diff mbox series

[v3] block: move non sync requests complete flow to softirq

Message ID 20240903115437.42307-1-zhanghui31@xiaomi.com (mailing list archive)
State New
Headers show
Series [v3] block: move non sync requests complete flow to softirq | expand

Commit Message

章辉 Sept. 3, 2024, 11:54 a.m. UTC
From: zhanghui <zhanghui31@xiaomi.com>

Currently, for a controller that supports multiple queues, like UFS4.0,
the mq_ops->complete is executed in the interrupt top-half. Therefore, 
the file system's end io is executed during the request completion process,
such as f2fs_write_end_io on smartphone.

However, we found that the execution time of the file system end io
is strongly related to the size of the bio and the processing speed
of the CPU. Because the file system's end io will traverse every page
in bio, this is a very time-consuming operation.

We measured that the 80M bio write operation on the little CPU will
cause the execution time of the top-half to be greater than 100ms.
The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
scheduling efficiency.

Let's fixed this issue by moved non sync request completion flow to
softirq, and keep the sync request completion in the top-half.

Signed-off-by: zhanghui <zhanghui31@xiaomi.com>
---
 block/blk-mq.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Bart Van Assche Sept. 3, 2024, 5:28 p.m. UTC | #1
On 9/3/24 4:54 AM, ZhangHui wrote:
> Currently, for a controller that supports multiple queues, like UFS4.0,
> the mq_ops->complete is executed in the interrupt top-half. Therefore,
> the file system's end io is executed during the request completion process,
> such as f2fs_write_end_io on smartphone.
> 
> However, we found that the execution time of the file system end io
> is strongly related to the size of the bio and the processing speed
> of the CPU. Because the file system's end io will traverse every page
> in bio, this is a very time-consuming operation.
> 
> We measured that the 80M bio write operation on the little CPU will
> cause the execution time of the top-half to be greater than 100ms.
> The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
> scheduling efficiency.
> 
> Let's fixed this issue by moved non sync request completion flow to
> softirq, and keep the sync request completion in the top-half.

An explanation is missing from the patch description why this issue
cannot be solved by changing rq_affinity to 2.

Thanks,

Bart.
Jens Axboe Sept. 3, 2024, 5:47 p.m. UTC | #2
On 9/3/24 11:28 AM, Bart Van Assche wrote:
> On 9/3/24 4:54 AM, ZhangHui wrote:
>> Currently, for a controller that supports multiple queues, like UFS4.0,
>> the mq_ops->complete is executed in the interrupt top-half. Therefore,
>> the file system's end io is executed during the request completion process,
>> such as f2fs_write_end_io on smartphone.
>>
>> However, we found that the execution time of the file system end io
>> is strongly related to the size of the bio and the processing speed
>> of the CPU. Because the file system's end io will traverse every page
>> in bio, this is a very time-consuming operation.
>>
>> We measured that the 80M bio write operation on the little CPU will
>> cause the execution time of the top-half to be greater than 100ms.
>> The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
>> scheduling efficiency.
>>
>> Let's fixed this issue by moved non sync request completion flow to
>> softirq, and keep the sync request completion in the top-half.
> 
> An explanation is missing from the patch description why this issue
> cannot be solved by changing rq_affinity to 2.

And what's also missing is a changelog - to the poster, always include
what's changed since the last version posted. Otherwise you just have
3 random patches posted and leave the discovery of why on earth there's
now a v3 to the reader in having to pull in all 3 versions and see if
the progression made sense.
Jens Axboe Sept. 3, 2024, 5:49 p.m. UTC | #3
On 9/3/24 5:54 AM, ZhangHui wrote:
> From: zhanghui <zhanghui31@xiaomi.com>
> 
> Currently, for a controller that supports multiple queues, like UFS4.0,
> the mq_ops->complete is executed in the interrupt top-half. Therefore, 
> the file system's end io is executed during the request completion process,
> such as f2fs_write_end_io on smartphone.
> 
> However, we found that the execution time of the file system end io
> is strongly related to the size of the bio and the processing speed
> of the CPU. Because the file system's end io will traverse every page
> in bio, this is a very time-consuming operation.
> 
> We measured that the 80M bio write operation on the little CPU will
> cause the execution time of the top-half to be greater than 100ms.
> The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
> scheduling efficiency.

The elephant in the room here is why an 80M completion takes 100 msec?
That seems... insane.

That aside, doing writes that big isn't great for latencies in general,
even if they are orders of magnitude smaller (as they should be). Maybe
this is solvable by just limiting the write size here.

But it really seems out of line for a write that size to take 100 msec
to process.
章辉 Sept. 4, 2024, 3:25 a.m. UTC | #4
On 2024/9/4 1:49, Jens Axboe wrote:
> On 9/3/24 5:54 AM, ZhangHui wrote:
>> From: zhanghui <zhanghui31@xiaomi.com>
>>
>> Currently, for a controller that supports multiple queues, like UFS4.0,
>> the mq_ops->complete is executed in the interrupt top-half. Therefore,
>> the file system's end io is executed during the request completion process,
>> such as f2fs_write_end_io on smartphone.
>>
>> However, we found that the execution time of the file system end io
>> is strongly related to the size of the bio and the processing speed
>> of the CPU. Because the file system's end io will traverse every page
>> in bio, this is a very time-consuming operation.
>>
>> We measured that the 80M bio write operation on the little CPU will
>> cause the execution time of the top-half to be greater than 100ms.
>> The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
>> scheduling efficiency.
> The elephant in the room here is why an 80M completion takes 100 msec?
> That seems... insane.
>
> That aside, doing writes that big isn't great for latencies in general,
> even if they are orders of magnitude smaller (as they should be). Maybe
> this is solvable by just limiting the write size here.
>
> But it really seems out of line for a write that size to take 100 msec
> to process.
>
> --
> Jens Axboe
>
hi Jens,

This problem is strongly related to whether the CPU is a large
core or a little core and the CPU frequency. On a large core, the time
will obviously be shorter, but we cannot assume which core the IO will
be completed on and the current CPU frequency...

Limiting the IO size is also a method, but how large to limit it is a
problem, and I am worried whether it will cause bandwidth loss in
some scenarios?

Thanks
Zhang
章辉 Sept. 4, 2024, 3:35 a.m. UTC | #5
在 2024/9/4 1:28, Bart Van Assche 写道:
> [外部邮件] 
> 此邮件来源于小米公司外部,请谨慎处理。若对邮件安全性存疑,请将邮件转发给misec@xiaomi.com进行反馈 
>
>
> On 9/3/24 4:54 AM, ZhangHui wrote:
>> Currently, for a controller that supports multiple queues, like UFS4.0,
>> the mq_ops->complete is executed in the interrupt top-half. Therefore,
>> the file system's end io is executed during the request completion 
>> process,
>> such as f2fs_write_end_io on smartphone.
>>
>> However, we found that the execution time of the file system end io
>> is strongly related to the size of the bio and the processing speed
>> of the CPU. Because the file system's end io will traverse every page
>> in bio, this is a very time-consuming operation.
>>
>> We measured that the 80M bio write operation on the little CPU will
>> cause the execution time of the top-half to be greater than 100ms.
>> The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
>> scheduling efficiency.
>>
>> Let's fixed this issue by moved non sync request completion flow to
>> softirq, and keep the sync request completion in the top-half.
>
> An explanation is missing from the patch description why this issue
> cannot be solved by changing rq_affinity to 2.
>
> Thanks,
>
> Bart.
>
hi Bart,

Does set rq_affinity to 2 mean QUEUE_FLAG_SAME_COMP?

This seems to determine on which core the current request is
completed, rather than in the interrupt top or bottom half?

Thanks
Zhang
Christoph Hellwig Sept. 4, 2024, 4:29 a.m. UTC | #6
On Tue, Sep 03, 2024 at 11:49:28AM -0600, Jens Axboe wrote:
> The elephant in the room here is why an 80M completion takes 100 msec?
> That seems... insane.
> 
> That aside, doing writes that big isn't great for latencies in general,
> even if they are orders of magnitude smaller (as they should be). Maybe
> this is solvable by just limiting the write size here.
> 
> But it really seems out of line for a write that size to take 100 msec
> to process.

pagecache state processing is quite inefficient, we had to limit
the number of them for XFS to avoid latency problems a while ago.
Note that moving to folios means you can process a lot more data
with the same number of completion iterations as well.  I'd suggest
the submitter looks into that for whatever file system they are using.
Ming Lei Sept. 4, 2024, 8:01 a.m. UTC | #7
On Tue, Sep 03, 2024 at 07:54:37PM +0800, ZhangHui wrote:
> From: zhanghui <zhanghui31@xiaomi.com>
> 
> Currently, for a controller that supports multiple queues, like UFS4.0,
> the mq_ops->complete is executed in the interrupt top-half. Therefore, 
> the file system's end io is executed during the request completion process,
> such as f2fs_write_end_io on smartphone.
> 
> However, we found that the execution time of the file system end io
> is strongly related to the size of the bio and the processing speed
> of the CPU. Because the file system's end io will traverse every page
> in bio, this is a very time-consuming operation.
> 
> We measured that the 80M bio write operation on the little CPU will

What is 80M bio?

It is one known issue that soft lockup may be triggered in case of N:M
blk-mq mapping, but not sure if that is the case.

What is nr_hw_queues(blk_mq) and nr_cpus in your system?

> cause the execution time of the top-half to be greater than 100ms.
> The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
> scheduling efficiency.

schedule is off too in softirq(bottom-half).

> 
> Let's fixed this issue by moved non sync request completion flow to
> softirq, and keep the sync request completion in the top-half.

If you do care interrupt-off or schedule-off latency, you may have to move
the IO handling into thread context in the driver.

BTW, threaded irq can't help you too.


Thanks, 
Ming
Bart Van Assche Sept. 4, 2024, 5:22 p.m. UTC | #8
On 9/3/24 8:35 PM, 章辉 wrote:
> Does set rq_affinity to 2 mean QUEUE_FLAG_SAME_COMP?

 From block/blk-sysfs.c:

	if (val == 2) {
		blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
		blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
	}

> This seems to determine on which core the current request is
> completed, rather than in the interrupt top or bottom half?

That's correct. I suggested this because I was wondering whether
spreading the I/O completion workload over more CPU cores would help?

Thanks,

Bart.
章辉 Sept. 5, 2024, 2:46 a.m. UTC | #9
On 2024/9/4 16:01, Ming Lei wrote:
> On Tue, Sep 03, 2024 at 07:54:37PM +0800, ZhangHui wrote:
>> From: zhanghui <zhanghui31@xiaomi.com>
>>
>> Currently, for a controller that supports multiple queues, like UFS4.0,
>> the mq_ops->complete is executed in the interrupt top-half. Therefore,
>> the file system's end io is executed during the request completion process,
>> such as f2fs_write_end_io on smartphone.
>>
>> However, we found that the execution time of the file system end io
>> is strongly related to the size of the bio and the processing speed
>> of the CPU. Because the file system's end io will traverse every page
>> in bio, this is a very time-consuming operation.
>>
>> We measured that the 80M bio write operation on the little CPU will
> What is 80M bio?
>
> It is one known issue that soft lockup may be triggered in case of N:M
> blk-mq mapping, but not sure if that is the case.
>
> What is nr_hw_queues(blk_mq) and nr_cpus in your system?
>
>> cause the execution time of the top-half to be greater than 100ms.
>> The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
>> scheduling efficiency.
> schedule is off too in softirq(bottom-half).
>
>> Let's fixed this issue by moved non sync request completion flow to
>> softirq, and keep the sync request completion in the top-half.
> If you do care interrupt-off or schedule-off latency, you may have to move
> the IO handling into thread context in the driver.
>
> BTW, threaded irq can't help you too.
>
>
> Thanks,
> Ming
>
hi Ming,

Very good reminder, thank you.

On smartphones, nr_hw_queues and nr_cpus are 1:1, I am more concerned
about the interrupt-off latency, which is more obvious on little cores.

Moving time-consuming work to the bottom half may not help with schedule
latency, but it is may helpful for interrupt response latency of other
modules in the system?

Thanks
Zhang
章辉 Sept. 5, 2024, 2:52 a.m. UTC | #10
On 2024/9/4 12:29, Christoph Hellwig wrote:
> On Tue, Sep 03, 2024 at 11:49:28AM -0600, Jens Axboe wrote:
>> The elephant in the room here is why an 80M completion takes 100 msec?
>> That seems... insane.
>>
>> That aside, doing writes that big isn't great for latencies in general,
>> even if they are orders of magnitude smaller (as they should be). Maybe
>> this is solvable by just limiting the write size here.
>>
>> But it really seems out of line for a write that size to take 100 msec
>> to process.
> pagecache state processing is quite inefficient, we had to limit
> the number of them for XFS to avoid latency problems a while ago.
> Note that moving to folios means you can process a lot more data
> with the same number of completion iterations as well.  I'd suggest
> the submitter looks into that for whatever file system they are using.
>
hi Christoph,

The F2FS file system is used on the smartphone, and end_io uses page
traversal instead of folio traversal.
I will confirm the plan to migrate to folio. Thank you!

Thanks
Zhang
章辉 Sept. 5, 2024, 3:05 a.m. UTC | #11
On 2024/9/5 1:22, Bart Van Assche wrote:
> On 9/3/24 8:35 PM, 章辉 wrote:
>> Does set rq_affinity to 2 mean QUEUE_FLAG_SAME_COMP?
>
> From block/blk-sysfs.c:
>
>        if (val == 2) {
>                blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
>                blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
>        }
>
>> This seems to determine on which core the current request is
>> completed, rather than in the interrupt top or bottom half?
>
> That's correct. I suggested this because I was wondering whether
> spreading the I/O completion workload over more CPU cores would help?
>
> Thanks,
>
> Bart.
>
hi Bart,

Increasing the CPU frequency or directly binding the interrupt to the
big core processing will definitely help the end io processing delay,
depending on the system's requirements for IO response delay.

But it does not seem to conflict with this modification to put the
time-consuming operation in the lower half.


Thanks
Zhang
Ming Lei Sept. 5, 2024, 3:49 a.m. UTC | #12
On Thu, Sep 05, 2024 at 02:46:39AM +0000, 章辉 wrote:
> On 2024/9/4 16:01, Ming Lei wrote:
> > On Tue, Sep 03, 2024 at 07:54:37PM +0800, ZhangHui wrote:
> >> From: zhanghui <zhanghui31@xiaomi.com>
> >>
> >> Currently, for a controller that supports multiple queues, like UFS4.0,
> >> the mq_ops->complete is executed in the interrupt top-half. Therefore,
> >> the file system's end io is executed during the request completion process,
> >> such as f2fs_write_end_io on smartphone.
> >>
> >> However, we found that the execution time of the file system end io
> >> is strongly related to the size of the bio and the processing speed
> >> of the CPU. Because the file system's end io will traverse every page
> >> in bio, this is a very time-consuming operation.
> >>
> >> We measured that the 80M bio write operation on the little CPU will
> > What is 80M bio?
> >
> > It is one known issue that soft lockup may be triggered in case of N:M
> > blk-mq mapping, but not sure if that is the case.
> >
> > What is nr_hw_queues(blk_mq) and nr_cpus in your system?
> >
> >> cause the execution time of the top-half to be greater than 100ms.
> >> The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
> >> scheduling efficiency.
> > schedule is off too in softirq(bottom-half).
> >
> >> Let's fixed this issue by moved non sync request completion flow to
> >> softirq, and keep the sync request completion in the top-half.
> > If you do care interrupt-off or schedule-off latency, you may have to move
> > the IO handling into thread context in the driver.
> >
> > BTW, threaded irq can't help you too.
> >
> >
> > Thanks,
> > Ming
> >
> hi Ming,
> 
> Very good reminder, thank you.
> 
> On smartphones, nr_hw_queues and nr_cpus are 1:1, I am more concerned
> about the interrupt-off latency, which is more obvious on little cores.

So you submits 80M bytes from one CPU, and almost all these bios are completed
in single interrupt, which looks very impossible, except that your
UFS controller is far faster than the CPU.

> 
> Moving time-consuming work to the bottom half may not help with schedule
> latency, but it is may helpful for interrupt response latency of other
> modules in the system?

scheduling response latency is system-wide too.

Then please document the interrupt latency improvement instead of
scheduling in your commit log, otherwise it is just misleading.

```
The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
scheduling efficiency.
```

Thanks,
Ming
章辉 Sept. 5, 2024, 7:33 a.m. UTC | #13
On 2024/9/5 11:49, Ming Lei wrote:
> On Thu, Sep 05, 2024 at 02:46:39AM +0000, 章辉 wrote:
>> On 2024/9/4 16:01, Ming Lei wrote:
>>> On Tue, Sep 03, 2024 at 07:54:37PM +0800, ZhangHui wrote:
>>>> From: zhanghui <zhanghui31@xiaomi.com>
>>>>
>>>> Currently, for a controller that supports multiple queues, like UFS4.0,
>>>> the mq_ops->complete is executed in the interrupt top-half. Therefore,
>>>> the file system's end io is executed during the request completion process,
>>>> such as f2fs_write_end_io on smartphone.
>>>>
>>>> However, we found that the execution time of the file system end io
>>>> is strongly related to the size of the bio and the processing speed
>>>> of the CPU. Because the file system's end io will traverse every page
>>>> in bio, this is a very time-consuming operation.
>>>>
>>>> We measured that the 80M bio write operation on the little CPU will
>>> What is 80M bio?
>>>
>>> It is one known issue that soft lockup may be triggered in case of N:M
>>> blk-mq mapping, but not sure if that is the case.
>>>
>>> What is nr_hw_queues(blk_mq) and nr_cpus in your system?
>>>
>>>> cause the execution time of the top-half to be greater than 100ms.
>>>> The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
>>>> scheduling efficiency.
>>> schedule is off too in softirq(bottom-half).
>>>
>>>> Let's fixed this issue by moved non sync request completion flow to
>>>> softirq, and keep the sync request completion in the top-half.
>>> If you do care interrupt-off or schedule-off latency, you may have to move
>>> the IO handling into thread context in the driver.
>>>
>>> BTW, threaded irq can't help you too.
>>>
>>>
>>> Thanks,
>>> Ming
>>>
>> hi Ming,
>>
>> Very good reminder, thank you.
>>
>> On smartphones, nr_hw_queues and nr_cpus are 1:1, I am more concerned
>> about the interrupt-off latency, which is more obvious on little cores.
> So you submits 80M bytes from one CPU, and almost all these bios are completed
> in single interrupt, which looks very impossible, except that your
> UFS controller is far faster than the CPU.

The 80M bio bio refers to the bio sent by the file system. At the block
layer, it will be split into many bios and form a bio chain. The
time-consuming part is end_io of filesystem processing all page state.
It will only be actually called after all the 80M BIOs in the bio chain
are completed.

>> Moving time-consuming work to the bottom half may not help with schedule
>> latency, but it is may helpful for interrupt response latency of other
>> modules in the system?
> scheduling response latency is system-wide too.
>
> Then please document the interrupt latency improvement instead of
> scheduling in your commit log, otherwise it is just misleading.
>
> ```
> The CPU tick on a smartphone is only 4ms, which will undoubtedly affect
> scheduling efficiency.
> ```
>
> Thanks,
> Ming

hi Ming,

OK, I will update patch V4 later for comment update.
Thank you for your suggestion!
Besides this,do you have any other concern?

Thanks
Zhang
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e3c3c0c21b55..06b232edff11 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1193,6 +1193,8 @@  static void blk_mq_raise_softirq(struct request *rq)
 
 bool blk_mq_complete_request_remote(struct request *rq)
 {
+	const bool is_sync = rq_is_sync(rq);
+
 	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
 
 	/*
@@ -1210,7 +1212,7 @@  bool blk_mq_complete_request_remote(struct request *rq)
 		return true;
 	}
 
-	if (rq->q->nr_hw_queues == 1) {
+	if ((rq->q->nr_hw_queues == 1) || !is_sync) {
 		blk_mq_raise_softirq(rq);
 		return true;
 	}