mbox series

[-next,RFC,0/6] improve large random io for HDD

Message ID 20220329094048.2107094-1-yukuai3@huawei.com (mailing list archive)
Headers show
Series improve large random io for HDD | expand

Message

Yu Kuai March 29, 2022, 9:40 a.m. UTC
There is a defect for blk-mq compare to blk-sq, specifically split io
will end up discontinuous if the device is under high io pressure, while
split io will still be continuous in sq, this is because:

1) split bio is issued one by one, if one bio can't get tag, it will go
to wail. - patch 2
2) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
Thus if a thread is woken up, it will unlikey to get multiple tags.
- patch 3,4
3) new io can preempt tag even if there are lots of threads waiting for
tags. - patch 5

Test environment:
x86 vm, nr_requests is set to 64, queue_depth is set to 32 and
max_sectors_kb is set to 128.

I haven't tested this patchset on physical machine yet, I'll try later
if anyone thinks this approch is meaningful.

Fio test cmd:
[global]
filename=/dev/sda
ioengine=libaio
direct=1
offset_increment=100m

[test]
rw=randwrite
bs=512k
numjobs=256
iodepth=2

Result: raito of sequential io(calculated from log by blktrace)
original:
21%
patched: split io thoroughly and wake up based on required tags.
40%
patched and set flag: disable tag preemption.
69%

Yu Kuai (6):
  blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION'
  block: refactor to split bio thoroughly
  blk-mq: record how many tags are needed for splited bio
  sbitmap: wake up the number of threads based on required tags
  blk-mq: don't preempt tag expect for split bios
  sbitmap: force tag preemption if free tags are sufficient

 block/bio.c               |  2 +
 block/blk-merge.c         | 95 ++++++++++++++++++++++++++++-----------
 block/blk-mq-debugfs.c    |  1 +
 block/blk-mq-tag.c        | 39 +++++++++++-----
 block/blk-mq.c            | 14 +++++-
 block/blk-mq.h            |  7 +++
 block/blk.h               |  3 +-
 include/linux/blk-mq.h    |  7 ++-
 include/linux/blk_types.h |  6 +++
 include/linux/sbitmap.h   |  8 ++++
 lib/sbitmap.c             | 33 +++++++++++++-
 11 files changed, 173 insertions(+), 42 deletions(-)

Comments

Jens Axboe March 29, 2022, 12:53 p.m. UTC | #1
On 3/29/22 3:40 AM, Yu Kuai wrote:
> There is a defect for blk-mq compare to blk-sq, specifically split io
> will end up discontinuous if the device is under high io pressure, while
> split io will still be continuous in sq, this is because:
> 
> 1) split bio is issued one by one, if one bio can't get tag, it will go
> to wail. - patch 2
> 2) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
> Thus if a thread is woken up, it will unlikey to get multiple tags.
> - patch 3,4
> 3) new io can preempt tag even if there are lots of threads waiting for
> tags. - patch 5
> 
> Test environment:
> x86 vm, nr_requests is set to 64, queue_depth is set to 32 and
> max_sectors_kb is set to 128.
> 
> I haven't tested this patchset on physical machine yet, I'll try later
> if anyone thinks this approch is meaningful.

A real machine test would definitely be a requirement. What real world
uses cases is this solving? These days most devices have plenty of tags,
and I would not really expect tag starvation to be much of a concern.

However, I do think there's merrit in fixing the unfairness we have
here. But not at the cost of all of this. Why not just simply enforce
more strict ordering of tag allocations? If someone is waiting, you get
to wait too.

And I don't see much utility at all in tracking how many splits (and
hence tags) would be required. Is this really a common issue, tons of
splits and needing many tags? Why not just enforce the strict ordering
as mentioned above, not allowing new allocators to get a tag if others
are waiting, but perhaps allow someone submitting a string of splits to
indeed keep allocating.

Yes, it'll be less efficient to still wake one-by-one, but honestly do
we really care about that? If you're stalled on waiting for other IO to
finish and release a tag, that isn't very efficient to begin with and
doesn't seem like a case worth optimizing for me.
Yu Kuai March 30, 2022, 2:05 a.m. UTC | #2
在 2022/03/29 20:53, Jens Axboe 写道:
> On 3/29/22 3:40 AM, Yu Kuai wrote:
>> There is a defect for blk-mq compare to blk-sq, specifically split io
>> will end up discontinuous if the device is under high io pressure, while
>> split io will still be continuous in sq, this is because:
>>
>> 1) split bio is issued one by one, if one bio can't get tag, it will go
>> to wail. - patch 2
>> 2) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>> - patch 3,4
>> 3) new io can preempt tag even if there are lots of threads waiting for
>> tags. - patch 5
>>
>> Test environment:
>> x86 vm, nr_requests is set to 64, queue_depth is set to 32 and
>> max_sectors_kb is set to 128.
>>
>> I haven't tested this patchset on physical machine yet, I'll try later
>> if anyone thinks this approch is meaningful.
> 
> A real machine test would definitely be a requirement. What real world
> uses cases is this solving? These days most devices have plenty of tags,
> and I would not really expect tag starvation to be much of a concern.
> 
> However, I do think there's merrit in fixing the unfairness we have
> here. But not at the cost of all of this. Why not just simply enforce
> more strict ordering of tag allocations? If someone is waiting, you get
> to wait too.
> 
> And I don't see much utility at all in tracking how many splits (and
> hence tags) would be required. Is this really a common issue, tons of
> splits and needing many tags? Why not just enforce the strict ordering
> as mentioned above, not allowing new allocators to get a tag if others
> are waiting, but perhaps allow someone submitting a string of splits to
> indeed keep allocating.
> 
> Yes, it'll be less efficient to still wake one-by-one, but honestly do
> we really care about that? If you're stalled on waiting for other IO to
> finish and release a tag, that isn't very efficient to begin with and
> doesn't seem like a case worth optimizing for me.
> 

Hi,

Thanks for your adivce, I'll do more work based on your suggestions.

Kuai