diff mbox series

[v5] blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq

Message ID 20230710105516.2053478-1-chengming.zhou@linux.dev (mailing list archive)
State New, archived
Headers show
Series [v5] blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq | expand

Commit Message

Chengming Zhou July 10, 2023, 10:55 a.m. UTC
From: Chengming Zhou <zhouchengming@bytedance.com>

The iocost rely on rq start_time_ns and alloc_time_ns to tell saturation
state of the block device. Most of the time request is allocated after
rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected.

But for plug batched allocation introduced by the commit 47c122e35d7e
("block: pre-allocate requests if plug is started and is a batch"), we can
rq_qos_throttle() after the allocation of the request. This is what the
blk_mq_get_cached_request() does.

In this case, the cached request alloc_time_ns or start_time_ns is much
ahead if blocked in any qos ->throttle().

Fix it by setting alloc_time_ns and start_time_ns to now when the allocated
request is actually used.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
v5:
 - Don't reuse __blk_mq_alloc_requests(), which makes it much complex.
 - Always initialize the time stamps after blk_mq_rq_ctx_init(), which
   is clearer, as suggested by Christoph Hellwig.
 - We still need to pass optional alloc_time_ns to blk_mq_rq_time_init(),
   which includes depth and tag waits time by definition.
 - [v4] https://lore.kernel.org/all/20230629121302.1124851-1-chengming.zhou@linux.dev/

v4:
 - Use blk_mq_alloc_data to pass start_time_ns instead of passing down
   yet another parameter. Thanks Christoph Hellwig.
 - [v3] https://lore.kernel.org/all/20230628124546.1056698-1-chengming.zhou@linux.dev/

v3:
 - Skip setting the alloc_time_ns and start_time_ns during pre-allocation,
   which is clearer, as suggested by Tejun.
 - [v2] https://lore.kernel.org/all/20230626050405.781253-1-chengming.zhou@linux.dev/
---
 block/blk-mq.c | 47 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 30 insertions(+), 17 deletions(-)

Comments

Christoph Hellwig July 10, 2023, 12:11 p.m. UTC | #1
As a personal preference I kinda hate the empty ?:, but otherwise this
looks good;

Reviewed-by: Christoph Hellwig <hch@lst.de>
Tejun Heo July 10, 2023, 7:42 p.m. UTC | #2
On Mon, Jul 10, 2023 at 06:55:16PM +0800, chengming.zhou@linux.dev wrote:
> From: Chengming Zhou <zhouchengming@bytedance.com>
> 
> The iocost rely on rq start_time_ns and alloc_time_ns to tell saturation
> state of the block device. Most of the time request is allocated after
> rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected.
> 
> But for plug batched allocation introduced by the commit 47c122e35d7e
> ("block: pre-allocate requests if plug is started and is a batch"), we can
> rq_qos_throttle() after the allocation of the request. This is what the
> blk_mq_get_cached_request() does.
> 
> In this case, the cached request alloc_time_ns or start_time_ns is much
> ahead if blocked in any qos ->throttle().
> 
> Fix it by setting alloc_time_ns and start_time_ns to now when the allocated
> request is actually used.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.
Jens Axboe July 10, 2023, 7:47 p.m. UTC | #3
On 7/10/23 4:55?AM, chengming.zhou@linux.dev wrote:
> From: Chengming Zhou <zhouchengming@bytedance.com>
> 
> The iocost rely on rq start_time_ns and alloc_time_ns to tell
> saturation state of the block device. Most of the time request is
> allocated after rq_qos_throttle() and its alloc_time_ns or
> start_time_ns won't be affected.
> 
> But for plug batched allocation introduced by the commit 47c122e35d7e
> ("block: pre-allocate requests if plug is started and is a batch"), we
> can rq_qos_throttle() after the allocation of the request. This is
> what the blk_mq_get_cached_request() does.
> 
> In this case, the cached request alloc_time_ns or start_time_ns is
> much ahead if blocked in any qos ->throttle().
> 
> Fix it by setting alloc_time_ns and start_time_ns to now when the
> allocated request is actually used.

One of the single most costly things we do in the io path is the crazy
amount of time stamping that's done or added without attention to what
else is doing it or where. And this is why we have a ton of them, and
why the batched time stamping made sense.

I'd much rather see this just get the saved timestamp updated when
necessary. If we block, that's certainly one such case.
Chengming Zhou July 13, 2023, 12:25 p.m. UTC | #4
On 2023/7/11 03:47, Jens Axboe wrote:
> On 7/10/23 4:55?AM, chengming.zhou@linux.dev wrote:
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> The iocost rely on rq start_time_ns and alloc_time_ns to tell
>> saturation state of the block device. Most of the time request is
>> allocated after rq_qos_throttle() and its alloc_time_ns or
>> start_time_ns won't be affected.
>>
>> But for plug batched allocation introduced by the commit 47c122e35d7e
>> ("block: pre-allocate requests if plug is started and is a batch"), we
>> can rq_qos_throttle() after the allocation of the request. This is
>> what the blk_mq_get_cached_request() does.
>>
>> In this case, the cached request alloc_time_ns or start_time_ns is
>> much ahead if blocked in any qos ->throttle().
>>
>> Fix it by setting alloc_time_ns and start_time_ns to now when the
>> allocated request is actually used.
> 
> One of the single most costly things we do in the io path is the crazy
> amount of time stamping that's done or added without attention to what
> else is doing it or where. And this is why we have a ton of them, and
> why the batched time stamping made sense.
> 
> I'd much rather see this just get the saved timestamp updated when
> necessary. If we block, that's certainly one such case.
> 

Ok, this version will only get time stamp once for one request, it's actually
not worse than the current code, which will get start time stamp once for each
request even in the batch allocation.

But yes, maybe we can also set the start time stamp in the batch mode, and only
update the time stamp in the block case, like you said, has better performance.

The first version [1] I posted actually just did this, in which use a nr_flush counter
in plug to indicate that we blocked & flushed plug. Tejun and I think it seems fragile.
So go to this way that only set time stamp once when the request actually used.

[1] https://lore.kernel.org/all/20230601053919.3639954-1-chengming.zhou@linux.dev/

Another way I can think of is to make rq_qos_throttle() return a bool to indicate
if it blocked. Tejun and Jens, how do you think about this way?

Although it's better performance, in case of preemption, the time stamp maybe not accurate.

Thanks.
Tejun Heo July 13, 2023, 5:58 p.m. UTC | #5
Hello,

On Thu, Jul 13, 2023 at 08:25:50PM +0800, Chengming Zhou wrote:
> Ok, this version will only get time stamp once for one request, it's actually
> not worse than the current code, which will get start time stamp once for each
> request even in the batch allocation.
> 
> But yes, maybe we can also set the start time stamp in the batch mode, and only
> update the time stamp in the block case, like you said, has better performance.
> 
> The first version [1] I posted actually just did this, in which use a nr_flush counter
> in plug to indicate that we blocked & flushed plug. Tejun and I think it seems fragile.
> So go to this way that only set time stamp once when the request actually used.
> 
> [1] https://lore.kernel.org/all/20230601053919.3639954-1-chengming.zhou@linux.dev/
> 
> Another way I can think of is to make rq_qos_throttle() return a bool to indicate
> if it blocked. Tejun and Jens, how do you think about this way?
> 
> Although it's better performance, in case of preemption, the time stamp maybe not accurate.

Trying to manually optimized timestamp reads seems like a bit of fool's
errand to me. I don't think anyone cares about nanosec accuracy, so there
are ample opportunities for generically caching timestamp so that we don't
have to contort code to optimzie timestamp calls.

It's a bit out of scope for this patchset but I think it might make sense to
build a timestamp caching infrastructure. The cached timestamp can be
invalidated on context switches (block layer already hooks into them) and
issue and other path boundaries (e.g. at the end of plug flush).

Thanks.
Jens Axboe July 13, 2023, 6:13 p.m. UTC | #6
On 7/13/23 11:58?AM, Tejun Heo wrote:
> It's a bit out of scope for this patchset but I think it might make
> sense to build a timestamp caching infrastructure. The cached
> timestamp can be invalidated on context switches (block layer already
> hooks into them) and issue and other path boundaries (e.g. at the end
> of plug flush).

This is a great idea! Have the plug init start with the timestamp
invalid, and use some blk_get_time() helpers that return the time for no
plug, and set it in the plug if not set. Flushing the plug would mark it
invalid again.

This obviously won't help the no plugging cases, but those are not that
interesting in comparison.
Jens Axboe July 13, 2023, 6:31 p.m. UTC | #7
On Mon, 10 Jul 2023 18:55:16 +0800, chengming.zhou@linux.dev wrote:
> The iocost rely on rq start_time_ns and alloc_time_ns to tell saturation
> state of the block device. Most of the time request is allocated after
> rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected.
> 
> But for plug batched allocation introduced by the commit 47c122e35d7e
> ("block: pre-allocate requests if plug is started and is a batch"), we can
> rq_qos_throttle() after the allocation of the request. This is what the
> blk_mq_get_cached_request() does.
> 
> [...]

Applied, thanks!

[1/1] blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq
      commit: 5c17f45e91f5035c1b317e93b3dfb01088ac2902

Best regards,
Chengming Zhou July 14, 2023, 11:31 a.m. UTC | #8
On 2023/7/14 01:58, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jul 13, 2023 at 08:25:50PM +0800, Chengming Zhou wrote:
>> Ok, this version will only get time stamp once for one request, it's actually
>> not worse than the current code, which will get start time stamp once for each
>> request even in the batch allocation.
>>
>> But yes, maybe we can also set the start time stamp in the batch mode, and only
>> update the time stamp in the block case, like you said, has better performance.
>>
>> The first version [1] I posted actually just did this, in which use a nr_flush counter
>> in plug to indicate that we blocked & flushed plug. Tejun and I think it seems fragile.
>> So go to this way that only set time stamp once when the request actually used.
>>
>> [1] https://lore.kernel.org/all/20230601053919.3639954-1-chengming.zhou@linux.dev/
>>
>> Another way I can think of is to make rq_qos_throttle() return a bool to indicate
>> if it blocked. Tejun and Jens, how do you think about this way?
>>
>> Although it's better performance, in case of preemption, the time stamp maybe not accurate.
> 
> Trying to manually optimized timestamp reads seems like a bit of fool's
> errand to me. I don't think anyone cares about nanosec accuracy, so there
> are ample opportunities for generically caching timestamp so that we don't
> have to contort code to optimzie timestamp calls.
> 
> It's a bit out of scope for this patchset but I think it might make sense to
> build a timestamp caching infrastructure. The cached timestamp can be
> invalidated on context switches (block layer already hooks into them) and
> issue and other path boundaries (e.g. at the end of plug flush).
> 

Yes, this is a really great idea. It has better performance and is more generic.

Thanks.
Jens Axboe July 14, 2023, 2:43 p.m. UTC | #9
On 7/14/23 5:31?AM, Chengming Zhou wrote:
> On 2023/7/14 01:58, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, Jul 13, 2023 at 08:25:50PM +0800, Chengming Zhou wrote:
>>> Ok, this version will only get time stamp once for one request, it's actually
>>> not worse than the current code, which will get start time stamp once for each
>>> request even in the batch allocation.
>>>
>>> But yes, maybe we can also set the start time stamp in the batch mode, and only
>>> update the time stamp in the block case, like you said, has better performance.
>>>
>>> The first version [1] I posted actually just did this, in which use a nr_flush counter
>>> in plug to indicate that we blocked & flushed plug. Tejun and I think it seems fragile.
>>> So go to this way that only set time stamp once when the request actually used.
>>>
>>> [1] https://lore.kernel.org/all/20230601053919.3639954-1-chengming.zhou@linux.dev/
>>>
>>> Another way I can think of is to make rq_qos_throttle() return a bool to indicate
>>> if it blocked. Tejun and Jens, how do you think about this way?
>>>
>>> Although it's better performance, in case of preemption, the time stamp maybe not accurate.
>>
>> Trying to manually optimized timestamp reads seems like a bit of fool's
>> errand to me. I don't think anyone cares about nanosec accuracy, so there
>> are ample opportunities for generically caching timestamp so that we don't
>> have to contort code to optimzie timestamp calls.
>>
>> It's a bit out of scope for this patchset but I think it might make sense to
>> build a timestamp caching infrastructure. The cached timestamp can be
>> invalidated on context switches (block layer already hooks into them) and
>> issue and other path boundaries (e.g. at the end of plug flush).
>>
> 
> Yes, this is a really great idea. It has better performance and is
> more generic.

Do you want to work on that approach? I pretty much outlined how I think
it'd work in the previous reply.
Chengming Zhou July 14, 2023, 2:49 p.m. UTC | #10
On 2023/7/14 22:43, Jens Axboe wrote:
> On 7/14/23 5:31?AM, Chengming Zhou wrote:
>> On 2023/7/14 01:58, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Thu, Jul 13, 2023 at 08:25:50PM +0800, Chengming Zhou wrote:
>>>> Ok, this version will only get time stamp once for one request, it's actually
>>>> not worse than the current code, which will get start time stamp once for each
>>>> request even in the batch allocation.
>>>>
>>>> But yes, maybe we can also set the start time stamp in the batch mode, and only
>>>> update the time stamp in the block case, like you said, has better performance.
>>>>
>>>> The first version [1] I posted actually just did this, in which use a nr_flush counter
>>>> in plug to indicate that we blocked & flushed plug. Tejun and I think it seems fragile.
>>>> So go to this way that only set time stamp once when the request actually used.
>>>>
>>>> [1] https://lore.kernel.org/all/20230601053919.3639954-1-chengming.zhou@linux.dev/
>>>>
>>>> Another way I can think of is to make rq_qos_throttle() return a bool to indicate
>>>> if it blocked. Tejun and Jens, how do you think about this way?
>>>>
>>>> Although it's better performance, in case of preemption, the time stamp maybe not accurate.
>>>
>>> Trying to manually optimized timestamp reads seems like a bit of fool's
>>> errand to me. I don't think anyone cares about nanosec accuracy, so there
>>> are ample opportunities for generically caching timestamp so that we don't
>>> have to contort code to optimzie timestamp calls.
>>>
>>> It's a bit out of scope for this patchset but I think it might make sense to
>>> build a timestamp caching infrastructure. The cached timestamp can be
>>> invalidated on context switches (block layer already hooks into them) and
>>> issue and other path boundaries (e.g. at the end of plug flush).
>>>
>>
>> Yes, this is a really great idea. It has better performance and is
>> more generic.
> 
> Do you want to work on that approach? I pretty much outlined how I think
> it'd work in the previous reply.
> 

Ok, I want to do it. Your outline is clear, will implement and test it.

Thanks!
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index decb6ab2d508..2352fed460f8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -337,8 +337,24 @@  void blk_rq_init(struct request_queue *q, struct request *rq)
 }
 EXPORT_SYMBOL(blk_rq_init);
 
+/* Set start and alloc time when the allocated request is actually used */
+static inline void blk_mq_rq_time_init(struct request *rq, u64 alloc_time_ns)
+{
+	if (blk_mq_need_time_stamp(rq))
+		rq->start_time_ns = ktime_get_ns();
+	else
+		rq->start_time_ns = 0;
+
+#ifdef CONFIG_BLK_RQ_ALLOC_TIME
+	if (blk_queue_rq_alloc_time(rq->q))
+		rq->alloc_time_ns = alloc_time_ns ?: rq->start_time_ns;
+	else
+		rq->alloc_time_ns = 0;
+#endif
+}
+
 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
-		struct blk_mq_tags *tags, unsigned int tag, u64 alloc_time_ns)
+		struct blk_mq_tags *tags, unsigned int tag)
 {
 	struct blk_mq_ctx *ctx = data->ctx;
 	struct blk_mq_hw_ctx *hctx = data->hctx;
@@ -365,14 +381,7 @@  static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	}
 	rq->timeout = 0;
 
-	if (blk_mq_need_time_stamp(rq))
-		rq->start_time_ns = ktime_get_ns();
-	else
-		rq->start_time_ns = 0;
 	rq->part = NULL;
-#ifdef CONFIG_BLK_RQ_ALLOC_TIME
-	rq->alloc_time_ns = alloc_time_ns;
-#endif
 	rq->io_start_time_ns = 0;
 	rq->stats_sectors = 0;
 	rq->nr_phys_segments = 0;
@@ -402,8 +411,7 @@  static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 }
 
 static inline struct request *
-__blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data,
-		u64 alloc_time_ns)
+__blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data)
 {
 	unsigned int tag, tag_offset;
 	struct blk_mq_tags *tags;
@@ -422,7 +430,7 @@  __blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data,
 		tag = tag_offset + i;
 		prefetch(tags->static_rqs[tag]);
 		tag_mask &= ~(1UL << i);
-		rq = blk_mq_rq_ctx_init(data, tags, tag, alloc_time_ns);
+		rq = blk_mq_rq_ctx_init(data, tags, tag);
 		rq_list_add(data->cached_rq, rq);
 		nr++;
 	}
@@ -483,9 +491,11 @@  static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 	 * Try batched alloc if we want more than 1 tag.
 	 */
 	if (data->nr_tags > 1) {
-		rq = __blk_mq_alloc_requests_batch(data, alloc_time_ns);
-		if (rq)
+		rq = __blk_mq_alloc_requests_batch(data);
+		if (rq) {
+			blk_mq_rq_time_init(rq, alloc_time_ns);
 			return rq;
+		}
 		data->nr_tags = 1;
 	}
 
@@ -508,8 +518,9 @@  static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 		goto retry;
 	}
 
-	return blk_mq_rq_ctx_init(data, blk_mq_tags_from_data(data), tag,
-					alloc_time_ns);
+	rq = blk_mq_rq_ctx_init(data, blk_mq_tags_from_data(data), tag);
+	blk_mq_rq_time_init(rq, alloc_time_ns);
+	return rq;
 }
 
 static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
@@ -564,6 +575,7 @@  static struct request *blk_mq_alloc_cached_request(struct request_queue *q,
 			return NULL;
 
 		plug->cached_rq = rq_list_next(rq);
+		blk_mq_rq_time_init(rq, 0);
 	}
 
 	rq->cmd_flags = opf;
@@ -665,8 +677,8 @@  struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	tag = blk_mq_get_tag(&data);
 	if (tag == BLK_MQ_NO_TAG)
 		goto out_queue_exit;
-	rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag,
-					alloc_time_ns);
+	rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag);
+	blk_mq_rq_time_init(rq, alloc_time_ns);
 	rq->__data_len = 0;
 	rq->__sector = (sector_t) -1;
 	rq->bio = rq->biotail = NULL;
@@ -2901,6 +2913,7 @@  static inline struct request *blk_mq_get_cached_request(struct request_queue *q,
 	plug->cached_rq = rq_list_next(rq);
 	rq_qos_throttle(q, *bio);
 
+	blk_mq_rq_time_init(rq, 0);
 	rq->cmd_flags = (*bio)->bi_opf;
 	INIT_LIST_HEAD(&rq->queuelist);
 	return rq;