mbox series

[-next,RFC,v3,0/8] improve tag allocation under heavy load

Message ID 20220415101053.554495-1-yukuai3@huawei.com (mailing list archive)
Headers show
Series improve tag allocation under heavy load | expand

Message

Yu Kuai April 15, 2022, 10:10 a.m. UTC
Changes in v3:
 - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
 in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
 'waiters_cnt' are all 0, which will cause deap loop.
 - don't add 'wait_index' during each loop in patch 2
 - fix that 'wake_index' might mismatch in the first wake up in patch 3,
 also improving coding for the patch.
 - add a detection in patch 4 in case io hung is triggered in corner
 cases.
 - make the detection, free tags are sufficient, more flexible.
 - fix a race in patch 8.
 - fix some words and add some comments.

Changes in v2:
 - use a new title
 - add patches to fix waitqueues' unfairness - path 1-3
 - delete patch to add queue flag
 - delete patch to split big io thoroughly

In this patchset:
 - patch 1-3 fix waitqueues' unfairness.
 - patch 4,5 disable tag preemption on heavy load.
 - patch 6 forces tag preemption for split bios.
 - patch 7,8 improve large random io for HDD. We do meet the problem and
 I'm trying to fix it at very low cost. However, if anyone still thinks
 this is not a common case and not worth to optimize, I'll drop them.

There is a defect for blk-mq compare to blk-sq, specifically split io
will end up discontinuous if the device is under high io pressure, while
split io will still be continuous in sq, this is because:

1) new io can preempt tag even if there are lots of threads waiting.
2) split bio is issued one by one, if one bio can't get tag, it will go
to wail.
3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
Thus if a thread is woken up, it will unlikey to get multiple tags.

The problem was first found by upgrading kernel from v3.10 to v4.18,
test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
ios with high concurrency.

Noted that there is a precondition for such performance problem:
There is a certain gap between bandwidth for single io with
bs=max_sectors_kb and disk upper limit.

During the test, I found that waitqueues can be extremly unbalanced on
heavy load. This is because 'wake_index' is not set properly in
__sbq_wake_up(), see details in patch 3.

Test environment:
arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
where 'max_sectors_kb' is 256).

The single io performance(randwrite):

| bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
| -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
| bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |

It can be seen that 1280k io is already close to upper limit, and it'll
be hard to see differences with the default value, thus I set
'max_sectors_kb' to 128 in the following test.

Test cmd:
        fio \
        -filename=/dev/$dev \
        -name=test \
        -ioengine=psync \
        -allow_mounted_write=0 \
        -group_reporting \
        -direct=1 \
        -offset_increment=1g \
        -rw=randwrite \
        -bs=1024k \
        -numjobs={1,2,4,8,16,32,64,128,256,512} \
        -runtime=110 \
        -ramp_time=10

Test result: MiB/s

| numjobs | v5.18-rc1 | v5.18-rc1-patched |
| ------- | --------- | ----------------- |
| 1       | 67.7      | 67.7              |
| 2       | 67.7      | 67.7              |
| 4       | 67.7      | 67.7              |
| 8       | 67.7      | 67.7              |
| 16      | 64.8      | 65.6              |
| 32      | 59.8      | 63.8              |
| 64      | 54.9      | 59.4              |
| 128     | 49        | 56.9              |
| 256     | 37.7      | 58.3              |
| 512     | 31.8      | 57.9              |

Yu Kuai (8):
  sbitmap: record the number of waiters for each waitqueue
  blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
  sbitmap: make sure waitqueues are balanced
  blk-mq: don't preempt tag under heavy load
  sbitmap: force tag preemption if free tags are sufficient
  blk-mq: force tag preemption for split bios
  blk-mq: record how many tags are needed for splited bio
  sbitmap: wake up the number of threads based on required tags

 block/blk-merge.c         |   8 +-
 block/blk-mq-tag.c        |  49 +++++++++----
 block/blk-mq.c            |  54 +++++++++++++-
 block/blk-mq.h            |   4 +
 include/linux/blk_types.h |   4 +
 include/linux/sbitmap.h   |   9 +++
 lib/sbitmap.c             | 149 +++++++++++++++++++++++++++-----------
 7 files changed, 216 insertions(+), 61 deletions(-)

Comments

Yu Kuai April 24, 2022, 2:43 a.m. UTC | #1
friendly ping ...

在 2022/04/15 18:10, Yu Kuai 写道:
> Changes in v3:
>   - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>   in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>   'waiters_cnt' are all 0, which will cause deap loop.
>   - don't add 'wait_index' during each loop in patch 2
>   - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>   also improving coding for the patch.
>   - add a detection in patch 4 in case io hung is triggered in corner
>   cases.
>   - make the detection, free tags are sufficient, more flexible.
>   - fix a race in patch 8.
>   - fix some words and add some comments.
> 
> Changes in v2:
>   - use a new title
>   - add patches to fix waitqueues' unfairness - path 1-3
>   - delete patch to add queue flag
>   - delete patch to split big io thoroughly
> 
> In this patchset:
>   - patch 1-3 fix waitqueues' unfairness.
>   - patch 4,5 disable tag preemption on heavy load.
>   - patch 6 forces tag preemption for split bios.
>   - patch 7,8 improve large random io for HDD. We do meet the problem and
>   I'm trying to fix it at very low cost. However, if anyone still thinks
>   this is not a common case and not worth to optimize, I'll drop them.
> 
> There is a defect for blk-mq compare to blk-sq, specifically split io
> will end up discontinuous if the device is under high io pressure, while
> split io will still be continuous in sq, this is because:
> 
> 1) new io can preempt tag even if there are lots of threads waiting.
> 2) split bio is issued one by one, if one bio can't get tag, it will go
> to wail.
> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
> Thus if a thread is woken up, it will unlikey to get multiple tags.
> 
> The problem was first found by upgrading kernel from v3.10 to v4.18,
> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
> ios with high concurrency.
> 
> Noted that there is a precondition for such performance problem:
> There is a certain gap between bandwidth for single io with
> bs=max_sectors_kb and disk upper limit.
> 
> During the test, I found that waitqueues can be extremly unbalanced on
> heavy load. This is because 'wake_index' is not set properly in
> __sbq_wake_up(), see details in patch 3.
> 
> Test environment:
> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
> where 'max_sectors_kb' is 256).
> 
> The single io performance(randwrite):
> 
> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
> 
> It can be seen that 1280k io is already close to upper limit, and it'll
> be hard to see differences with the default value, thus I set
> 'max_sectors_kb' to 128 in the following test.
> 
> Test cmd:
>          fio \
>          -filename=/dev/$dev \
>          -name=test \
>          -ioengine=psync \
>          -allow_mounted_write=0 \
>          -group_reporting \
>          -direct=1 \
>          -offset_increment=1g \
>          -rw=randwrite \
>          -bs=1024k \
>          -numjobs={1,2,4,8,16,32,64,128,256,512} \
>          -runtime=110 \
>          -ramp_time=10
> 
> Test result: MiB/s
> 
> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
> | ------- | --------- | ----------------- |
> | 1       | 67.7      | 67.7              |
> | 2       | 67.7      | 67.7              |
> | 4       | 67.7      | 67.7              |
> | 8       | 67.7      | 67.7              |
> | 16      | 64.8      | 65.6              |
> | 32      | 59.8      | 63.8              |
> | 64      | 54.9      | 59.4              |
> | 128     | 49        | 56.9              |
> | 256     | 37.7      | 58.3              |
> | 512     | 31.8      | 57.9              |
> 
> Yu Kuai (8):
>    sbitmap: record the number of waiters for each waitqueue
>    blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
>    sbitmap: make sure waitqueues are balanced
>    blk-mq: don't preempt tag under heavy load
>    sbitmap: force tag preemption if free tags are sufficient
>    blk-mq: force tag preemption for split bios
>    blk-mq: record how many tags are needed for splited bio
>    sbitmap: wake up the number of threads based on required tags
> 
>   block/blk-merge.c         |   8 +-
>   block/blk-mq-tag.c        |  49 +++++++++----
>   block/blk-mq.c            |  54 +++++++++++++-
>   block/blk-mq.h            |   4 +
>   include/linux/blk_types.h |   4 +
>   include/linux/sbitmap.h   |   9 +++
>   lib/sbitmap.c             | 149 +++++++++++++++++++++++++++-----------
>   7 files changed, 216 insertions(+), 61 deletions(-)
>
Bart Van Assche April 25, 2022, 3:09 a.m. UTC | #2
On 4/15/22 03:10, Yu Kuai wrote:
> The single io performance(randwrite):
> 
> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |

Although the above data is interesting, it is not sufficient. The above 
data comes from a setup with a single hard disk. There are many other 
configurations that are relevant (hard disk array, high speed NVMe, QD=1 
USB stick, ...) but for which no conclusions can be drawn from the above 
data.

Another question is whether the approach of this patch series is the 
right approach? I would expect that round-robin wakeup of waiters would 
be ideal from a fairness point of view. However, there are patches in 
this patch series that guarantee that wakeup of tag waiters won't happen 
in a round robin fashion.

Thanks,

Bart.
Damien Le Moal April 25, 2022, 3:24 a.m. UTC | #3
On 4/24/22 11:43, yukuai (C) wrote:
> friendly ping ...
> 
> 在 2022/04/15 18:10, Yu Kuai 写道:
>> Changes in v3:
>>   - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>   in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>   'waiters_cnt' are all 0, which will cause deap loop.
>>   - don't add 'wait_index' during each loop in patch 2
>>   - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>   also improving coding for the patch.
>>   - add a detection in patch 4 in case io hung is triggered in corner
>>   cases.
>>   - make the detection, free tags are sufficient, more flexible.
>>   - fix a race in patch 8.
>>   - fix some words and add some comments.
>>
>> Changes in v2:
>>   - use a new title
>>   - add patches to fix waitqueues' unfairness - path 1-3
>>   - delete patch to add queue flag
>>   - delete patch to split big io thoroughly
>>
>> In this patchset:
>>   - patch 1-3 fix waitqueues' unfairness.
>>   - patch 4,5 disable tag preemption on heavy load.
>>   - patch 6 forces tag preemption for split bios.
>>   - patch 7,8 improve large random io for HDD. We do meet the problem and
>>   I'm trying to fix it at very low cost. However, if anyone still thinks
>>   this is not a common case and not worth to optimize, I'll drop them.
>>
>> There is a defect for blk-mq compare to blk-sq, specifically split io
>> will end up discontinuous if the device is under high io pressure, while
>> split io will still be continuous in sq, this is because:
>>
>> 1) new io can preempt tag even if there are lots of threads waiting.
>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>> to wail.
>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>
>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>> ios with high concurrency.
>>
>> Noted that there is a precondition for such performance problem:
>> There is a certain gap between bandwidth for single io with
>> bs=max_sectors_kb and disk upper limit.
>>
>> During the test, I found that waitqueues can be extremly unbalanced on
>> heavy load. This is because 'wake_index' is not set properly in
>> __sbq_wake_up(), see details in patch 3.
>>
>> Test environment:
>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>> where 'max_sectors_kb' is 256).>>
>> The single io performance(randwrite):
>>
>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |

These results are extremely strange, unless you are running with the
device write cache disabled ? If you have the device write cache enabled,
the problem you mention above would be most likely completely invisible,
which I guess is why nobody really noticed any issue until now.

Similarly, with reads, the device side read-ahead may hide the problem,
albeit that depends on how "intelligent" the drive is at identifying
sequential accesses.

>>
>> It can be seen that 1280k io is already close to upper limit, and it'll
>> be hard to see differences with the default value, thus I set
>> 'max_sectors_kb' to 128 in the following test.
>>
>> Test cmd:
>>          fio \
>>          -filename=/dev/$dev \
>>          -name=test \
>>          -ioengine=psync \
>>          -allow_mounted_write=0 \
>>          -group_reporting \
>>          -direct=1 \
>>          -offset_increment=1g \
>>          -rw=randwrite \
>>          -bs=1024k \
>>          -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>          -runtime=110 \
>>          -ramp_time=10
>>
>> Test result: MiB/s
>>
>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>> | ------- | --------- | ----------------- |
>> | 1       | 67.7      | 67.7              |
>> | 2       | 67.7      | 67.7              |
>> | 4       | 67.7      | 67.7              |
>> | 8       | 67.7      | 67.7              |
>> | 16      | 64.8      | 65.6              |
>> | 32      | 59.8      | 63.8              |
>> | 64      | 54.9      | 59.4              |
>> | 128     | 49        | 56.9              |
>> | 256     | 37.7      | 58.3              |
>> | 512     | 31.8      | 57.9              |

Device write cache disabled ?

Also, what is the max QD of this disk ?

E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
tags. So for any of your tests with more than 64 threads, many of the
threads will be waiting for a scheduler tag for the BIO before the
bio_split problem you explain triggers. Given that the numbers you show
are the same for before-after patch with a number of threads <= 64, I am
tempted to think that the problem is not really BIO splitting...

What about random read workloads ? What kind of results do you see ?

>>
>> Yu Kuai (8):
>>    sbitmap: record the number of waiters for each waitqueue
>>    blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
>>    sbitmap: make sure waitqueues are balanced
>>    blk-mq: don't preempt tag under heavy load
>>    sbitmap: force tag preemption if free tags are sufficient
>>    blk-mq: force tag preemption for split bios
>>    blk-mq: record how many tags are needed for splited bio
>>    sbitmap: wake up the number of threads based on required tags
>>
>>   block/blk-merge.c         |   8 +-
>>   block/blk-mq-tag.c        |  49 +++++++++----
>>   block/blk-mq.c            |  54 +++++++++++++-
>>   block/blk-mq.h            |   4 +
>>   include/linux/blk_types.h |   4 +
>>   include/linux/sbitmap.h   |   9 +++
>>   lib/sbitmap.c             | 149 +++++++++++++++++++++++++++-----------
>>   7 files changed, 216 insertions(+), 61 deletions(-)
>>
Yu Kuai April 25, 2022, 3:27 a.m. UTC | #4
在 2022/04/25 11:09, Bart Van Assche 写道:
> On 4/15/22 03:10, Yu Kuai wrote:
>> The single io performance(randwrite):
>>
>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
> 
> Although the above data is interesting, it is not sufficient. The above 
> data comes from a setup with a single hard disk. There are many other 
> configurations that are relevant (hard disk array, high speed NVMe, QD=1 
> USB stick, ...) but for which no conclusions can be drawn from the above 
> data.
Hi,

The original idea is to improve large bs randwrite performance in HDD,
here I just test the specific case in a HDD. It's right many other test
cases and configurations are relevant.
> 
> Another question is whether the approach of this patch series is the 
> right approach? I would expect that round-robin wakeup of waiters would 
> be ideal from a fairness point of view. However, there are patches in 
> this patch series that guarantee that wakeup of tag waiters won't happen 
> in a round robin fashion.

I was thinking that round-robin can't grantee fairness in the corner
case that 8 waitqueues are not balanced. For example, one waitqueue
somehow have lots of waiters, and it's better to handle them before
newcome waiters in other waitqueues.

How you think abount this way, keep round-robin wakeup if waitqueues
are balanced, otherwise choose the waitqueue with the max waiters.

Thanks,
Kuai
> 
> Thanks,
> 
> Bart.
> .
>
Yu Kuai April 25, 2022, 6:14 a.m. UTC | #5
在 2022/04/25 11:24, Damien Le Moal 写道:
> On 4/24/22 11:43, yukuai (C) wrote:
>> friendly ping ...
>>
>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>> Changes in v3:
>>>    - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>    in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>    'waiters_cnt' are all 0, which will cause deap loop.
>>>    - don't add 'wait_index' during each loop in patch 2
>>>    - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>    also improving coding for the patch.
>>>    - add a detection in patch 4 in case io hung is triggered in corner
>>>    cases.
>>>    - make the detection, free tags are sufficient, more flexible.
>>>    - fix a race in patch 8.
>>>    - fix some words and add some comments.
>>>
>>> Changes in v2:
>>>    - use a new title
>>>    - add patches to fix waitqueues' unfairness - path 1-3
>>>    - delete patch to add queue flag
>>>    - delete patch to split big io thoroughly
>>>
>>> In this patchset:
>>>    - patch 1-3 fix waitqueues' unfairness.
>>>    - patch 4,5 disable tag preemption on heavy load.
>>>    - patch 6 forces tag preemption for split bios.
>>>    - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>    I'm trying to fix it at very low cost. However, if anyone still thinks
>>>    this is not a common case and not worth to optimize, I'll drop them.
>>>
>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>> will end up discontinuous if the device is under high io pressure, while
>>> split io will still be continuous in sq, this is because:
>>>
>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>> to wail.
>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>
>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>> ios with high concurrency.
>>>
>>> Noted that there is a precondition for such performance problem:
>>> There is a certain gap between bandwidth for single io with
>>> bs=max_sectors_kb and disk upper limit.
>>>
>>> During the test, I found that waitqueues can be extremly unbalanced on
>>> heavy load. This is because 'wake_index' is not set properly in
>>> __sbq_wake_up(), see details in patch 3.
>>>
>>> Test environment:
>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>> where 'max_sectors_kb' is 256).>>
>>> The single io performance(randwrite):
>>>
>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
> 
> These results are extremely strange, unless you are running with the
> device write cache disabled ? If you have the device write cache enabled,
> the problem you mention above would be most likely completely invisible,
> which I guess is why nobody really noticed any issue until now.
> 
> Similarly, with reads, the device side read-ahead may hide the problem,
> albeit that depends on how "intelligent" the drive is at identifying
> sequential accesses.
> 
>>>
>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>> be hard to see differences with the default value, thus I set
>>> 'max_sectors_kb' to 128 in the following test.
>>>
>>> Test cmd:
>>>           fio \
>>>           -filename=/dev/$dev \
>>>           -name=test \
>>>           -ioengine=psync \
>>>           -allow_mounted_write=0 \
>>>           -group_reporting \
>>>           -direct=1 \
>>>           -offset_increment=1g \
>>>           -rw=randwrite \
>>>           -bs=1024k \
>>>           -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>           -runtime=110 \
>>>           -ramp_time=10
>>>
>>> Test result: MiB/s
>>>
>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>> | ------- | --------- | ----------------- |
>>> | 1       | 67.7      | 67.7              |
>>> | 2       | 67.7      | 67.7              |
>>> | 4       | 67.7      | 67.7              |
>>> | 8       | 67.7      | 67.7              |
>>> | 16      | 64.8      | 65.6              |
>>> | 32      | 59.8      | 63.8              |
>>> | 64      | 54.9      | 59.4              |
>>> | 128     | 49        | 56.9              |
>>> | 256     | 37.7      | 58.3              |
>>> | 512     | 31.8      | 57.9              |
> 
> Device write cache disabled ?
> 
> Also, what is the max QD of this disk ?
> 
> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
> tags. So for any of your tests with more than 64 threads, many of the
> threads will be waiting for a scheduler tag for the BIO before the
> bio_split problem you explain triggers. Given that the numbers you show
> are the same for before-after patch with a number of threads <= 64, I am
> tempted to think that the problem is not really BIO splitting...
> 
> What about random read workloads ? What kind of results do you see ?

Hi,

Sorry about the misleading of this test case.

This testcase is high concurrency huge randwrite, it's just for the
problem that split bios won't be issued continuously, which is the
root cause of the performance degradation as the numjobs increases.

queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
than 8, performance is fine, because the ratio of sequential io should
be 7/8. However, as numjobs increases, performance is worse because
the ratio is lower. For example, when numjobs is 512, the ratio of
sequential io is about 20%.

patch 6-8 will let split bios still be issued continuously under high
pressure.

Thanks,
Kuai
Damien Le Moal April 25, 2022, 6:23 a.m. UTC | #6
On 4/25/22 15:14, yukuai (C) wrote:
> 在 2022/04/25 11:24, Damien Le Moal 写道:
>> On 4/24/22 11:43, yukuai (C) wrote:
>>> friendly ping ...
>>>
>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>> Changes in v3:
>>>>    - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>    in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>    'waiters_cnt' are all 0, which will cause deap loop.
>>>>    - don't add 'wait_index' during each loop in patch 2
>>>>    - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>    also improving coding for the patch.
>>>>    - add a detection in patch 4 in case io hung is triggered in corner
>>>>    cases.
>>>>    - make the detection, free tags are sufficient, more flexible.
>>>>    - fix a race in patch 8.
>>>>    - fix some words and add some comments.
>>>>
>>>> Changes in v2:
>>>>    - use a new title
>>>>    - add patches to fix waitqueues' unfairness - path 1-3
>>>>    - delete patch to add queue flag
>>>>    - delete patch to split big io thoroughly
>>>>
>>>> In this patchset:
>>>>    - patch 1-3 fix waitqueues' unfairness.
>>>>    - patch 4,5 disable tag preemption on heavy load.
>>>>    - patch 6 forces tag preemption for split bios.
>>>>    - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>    I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>    this is not a common case and not worth to optimize, I'll drop them.
>>>>
>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>> will end up discontinuous if the device is under high io pressure, while
>>>> split io will still be continuous in sq, this is because:
>>>>
>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>> to wail.
>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>
>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>> ios with high concurrency.
>>>>
>>>> Noted that there is a precondition for such performance problem:
>>>> There is a certain gap between bandwidth for single io with
>>>> bs=max_sectors_kb and disk upper limit.
>>>>
>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>> heavy load. This is because 'wake_index' is not set properly in
>>>> __sbq_wake_up(), see details in patch 3.
>>>>
>>>> Test environment:
>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>> where 'max_sectors_kb' is 256).>>
>>>> The single io performance(randwrite):
>>>>
>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>
>> These results are extremely strange, unless you are running with the
>> device write cache disabled ? If you have the device write cache enabled,
>> the problem you mention above would be most likely completely invisible,
>> which I guess is why nobody really noticed any issue until now.
>>
>> Similarly, with reads, the device side read-ahead may hide the problem,
>> albeit that depends on how "intelligent" the drive is at identifying
>> sequential accesses.
>>
>>>>
>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>> be hard to see differences with the default value, thus I set
>>>> 'max_sectors_kb' to 128 in the following test.
>>>>
>>>> Test cmd:
>>>>           fio \
>>>>           -filename=/dev/$dev \
>>>>           -name=test \
>>>>           -ioengine=psync \
>>>>           -allow_mounted_write=0 \
>>>>           -group_reporting \
>>>>           -direct=1 \
>>>>           -offset_increment=1g \
>>>>           -rw=randwrite \
>>>>           -bs=1024k \
>>>>           -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>           -runtime=110 \
>>>>           -ramp_time=10
>>>>
>>>> Test result: MiB/s
>>>>
>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>> | ------- | --------- | ----------------- |
>>>> | 1       | 67.7      | 67.7              |
>>>> | 2       | 67.7      | 67.7              |
>>>> | 4       | 67.7      | 67.7              |
>>>> | 8       | 67.7      | 67.7              |
>>>> | 16      | 64.8      | 65.6              |
>>>> | 32      | 59.8      | 63.8              |
>>>> | 64      | 54.9      | 59.4              |
>>>> | 128     | 49        | 56.9              |
>>>> | 256     | 37.7      | 58.3              |
>>>> | 512     | 31.8      | 57.9              |
>>
>> Device write cache disabled ?
>>
>> Also, what is the max QD of this disk ?
>>
>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>> tags. So for any of your tests with more than 64 threads, many of the
>> threads will be waiting for a scheduler tag for the BIO before the
>> bio_split problem you explain triggers. Given that the numbers you show
>> are the same for before-after patch with a number of threads <= 64, I am
>> tempted to think that the problem is not really BIO splitting...
>>
>> What about random read workloads ? What kind of results do you see ?
> 
> Hi,
> 
> Sorry about the misleading of this test case.
> 
> This testcase is high concurrency huge randwrite, it's just for the
> problem that split bios won't be issued continuously, which is the
> root cause of the performance degradation as the numjobs increases.
> 
> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
> than 8, performance is fine, because the ratio of sequential io should
> be 7/8. However, as numjobs increases, performance is worse because
> the ratio is lower. For example, when numjobs is 512, the ratio of
> sequential io is about 20%.

But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
All other jobs will be waiting for a scheduler tag before being able to
issue their large BIO. No ?

It sounds like the set of scheduler tags should be a bit more elastic:
always allow BIOs from a split of a large BIO to be submitted (that is to
get a scheduler tag) even if that causes a temporary excess of the number
of requests beyond the default number of scheduler tags. Doing so, all
fragments of a large BIOs can be queued immediately. From there, if the
scheduler operates correctly, all the requests from the large BIOs split
would be issued in sequence to the device.


> 
> patch 6-8 will let split bios still be issued continuously under high
> pressure.
> 
> Thanks,
> Kuai
>
Yu Kuai April 25, 2022, 6:47 a.m. UTC | #7
在 2022/04/25 14:23, Damien Le Moal 写道:
> On 4/25/22 15:14, yukuai (C) wrote:
>> 在 2022/04/25 11:24, Damien Le Moal 写道:
>>> On 4/24/22 11:43, yukuai (C) wrote:
>>>> friendly ping ...
>>>>
>>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>>> Changes in v3:
>>>>>     - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>>     in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>>     'waiters_cnt' are all 0, which will cause deap loop.
>>>>>     - don't add 'wait_index' during each loop in patch 2
>>>>>     - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>>     also improving coding for the patch.
>>>>>     - add a detection in patch 4 in case io hung is triggered in corner
>>>>>     cases.
>>>>>     - make the detection, free tags are sufficient, more flexible.
>>>>>     - fix a race in patch 8.
>>>>>     - fix some words and add some comments.
>>>>>
>>>>> Changes in v2:
>>>>>     - use a new title
>>>>>     - add patches to fix waitqueues' unfairness - path 1-3
>>>>>     - delete patch to add queue flag
>>>>>     - delete patch to split big io thoroughly
>>>>>
>>>>> In this patchset:
>>>>>     - patch 1-3 fix waitqueues' unfairness.
>>>>>     - patch 4,5 disable tag preemption on heavy load.
>>>>>     - patch 6 forces tag preemption for split bios.
>>>>>     - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>>     I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>>     this is not a common case and not worth to optimize, I'll drop them.
>>>>>
>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>>> will end up discontinuous if the device is under high io pressure, while
>>>>> split io will still be continuous in sq, this is because:
>>>>>
>>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>>> to wail.
>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>>
>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>>> ios with high concurrency.
>>>>>
>>>>> Noted that there is a precondition for such performance problem:
>>>>> There is a certain gap between bandwidth for single io with
>>>>> bs=max_sectors_kb and disk upper limit.
>>>>>
>>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>>> heavy load. This is because 'wake_index' is not set properly in
>>>>> __sbq_wake_up(), see details in patch 3.
>>>>>
>>>>> Test environment:
>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>>> where 'max_sectors_kb' is 256).>>
>>>>> The single io performance(randwrite):
>>>>>
>>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>>
>>> These results are extremely strange, unless you are running with the
>>> device write cache disabled ? If you have the device write cache enabled,
>>> the problem you mention above would be most likely completely invisible,
>>> which I guess is why nobody really noticed any issue until now.
>>>
>>> Similarly, with reads, the device side read-ahead may hide the problem,
>>> albeit that depends on how "intelligent" the drive is at identifying
>>> sequential accesses.
>>>
>>>>>
>>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>>> be hard to see differences with the default value, thus I set
>>>>> 'max_sectors_kb' to 128 in the following test.
>>>>>
>>>>> Test cmd:
>>>>>            fio \
>>>>>            -filename=/dev/$dev \
>>>>>            -name=test \
>>>>>            -ioengine=psync \
>>>>>            -allow_mounted_write=0 \
>>>>>            -group_reporting \
>>>>>            -direct=1 \
>>>>>            -offset_increment=1g \
>>>>>            -rw=randwrite \
>>>>>            -bs=1024k \
>>>>>            -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>>            -runtime=110 \
>>>>>            -ramp_time=10
>>>>>
>>>>> Test result: MiB/s
>>>>>
>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>>> | ------- | --------- | ----------------- |
>>>>> | 1       | 67.7      | 67.7              |
>>>>> | 2       | 67.7      | 67.7              |
>>>>> | 4       | 67.7      | 67.7              |
>>>>> | 8       | 67.7      | 67.7              |
>>>>> | 16      | 64.8      | 65.6              |
>>>>> | 32      | 59.8      | 63.8              |
>>>>> | 64      | 54.9      | 59.4              |
>>>>> | 128     | 49        | 56.9              |
>>>>> | 256     | 37.7      | 58.3              |
>>>>> | 512     | 31.8      | 57.9              |
>>>
>>> Device write cache disabled ?
>>>
>>> Also, what is the max QD of this disk ?
>>>
>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>>> tags. So for any of your tests with more than 64 threads, many of the
>>> threads will be waiting for a scheduler tag for the BIO before the
>>> bio_split problem you explain triggers. Given that the numbers you show
>>> are the same for before-after patch with a number of threads <= 64, I am
>>> tempted to think that the problem is not really BIO splitting...
>>>
>>> What about random read workloads ? What kind of results do you see ?
>>
>> Hi,
>>
>> Sorry about the misleading of this test case.
>>
>> This testcase is high concurrency huge randwrite, it's just for the
>> problem that split bios won't be issued continuously, which is the
>> root cause of the performance degradation as the numjobs increases.
>>
>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
>> than 8, performance is fine, because the ratio of sequential io should
>> be 7/8. However, as numjobs increases, performance is worse because
>> the ratio is lower. For example, when numjobs is 512, the ratio of
>> sequential io is about 20%.
> 
> But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
> All other jobs will be waiting for a scheduler tag before being able to
> issue their large BIO. No ?

Hi,

It's right.

In fact, after this patchset, since each large io will need total 8
tags, only 8 jobs can be in the queue while others are waiting for
scheduler tag.

> 
> It sounds like the set of scheduler tags should be a bit more elastic:
> always allow BIOs from a split of a large BIO to be submitted (that is to
> get a scheduler tag) even if that causes a temporary excess of the number
> of requests beyond the default number of scheduler tags. Doing so, all
> fragments of a large BIOs can be queued immediately. From there, if the
> scheduler operates correctly, all the requests from the large BIOs split
> would be issued in sequence to the device.

This solution sounds feasible in theory, however, I'm not sure yet how
to implement that 'temporary excess'.

Thanks,
Kuai
> 
> 
>>
>> patch 6-8 will let split bios still be issued continuously under high
>> pressure.
>>
>> Thanks,
>> Kuai
>>
> 
>
Damien Le Moal April 25, 2022, 6:50 a.m. UTC | #8
On 4/25/22 15:47, yukuai (C) wrote:
> 在 2022/04/25 14:23, Damien Le Moal 写道:
>> On 4/25/22 15:14, yukuai (C) wrote:
>>> 在 2022/04/25 11:24, Damien Le Moal 写道:
>>>> On 4/24/22 11:43, yukuai (C) wrote:
>>>>> friendly ping ...
>>>>>
>>>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>>>> Changes in v3:
>>>>>>     - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>>>     in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>>>     'waiters_cnt' are all 0, which will cause deap loop.
>>>>>>     - don't add 'wait_index' during each loop in patch 2
>>>>>>     - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>>>     also improving coding for the patch.
>>>>>>     - add a detection in patch 4 in case io hung is triggered in corner
>>>>>>     cases.
>>>>>>     - make the detection, free tags are sufficient, more flexible.
>>>>>>     - fix a race in patch 8.
>>>>>>     - fix some words and add some comments.
>>>>>>
>>>>>> Changes in v2:
>>>>>>     - use a new title
>>>>>>     - add patches to fix waitqueues' unfairness - path 1-3
>>>>>>     - delete patch to add queue flag
>>>>>>     - delete patch to split big io thoroughly
>>>>>>
>>>>>> In this patchset:
>>>>>>     - patch 1-3 fix waitqueues' unfairness.
>>>>>>     - patch 4,5 disable tag preemption on heavy load.
>>>>>>     - patch 6 forces tag preemption for split bios.
>>>>>>     - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>>>     I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>>>     this is not a common case and not worth to optimize, I'll drop them.
>>>>>>
>>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>>>> will end up discontinuous if the device is under high io pressure, while
>>>>>> split io will still be continuous in sq, this is because:
>>>>>>
>>>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>>>> to wail.
>>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>>>
>>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>>>> ios with high concurrency.
>>>>>>
>>>>>> Noted that there is a precondition for such performance problem:
>>>>>> There is a certain gap between bandwidth for single io with
>>>>>> bs=max_sectors_kb and disk upper limit.
>>>>>>
>>>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>>>> heavy load. This is because 'wake_index' is not set properly in
>>>>>> __sbq_wake_up(), see details in patch 3.
>>>>>>
>>>>>> Test environment:
>>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>>>> where 'max_sectors_kb' is 256).>>
>>>>>> The single io performance(randwrite):
>>>>>>
>>>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>>>
>>>> These results are extremely strange, unless you are running with the
>>>> device write cache disabled ? If you have the device write cache enabled,
>>>> the problem you mention above would be most likely completely invisible,
>>>> which I guess is why nobody really noticed any issue until now.
>>>>
>>>> Similarly, with reads, the device side read-ahead may hide the problem,
>>>> albeit that depends on how "intelligent" the drive is at identifying
>>>> sequential accesses.
>>>>
>>>>>>
>>>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>>>> be hard to see differences with the default value, thus I set
>>>>>> 'max_sectors_kb' to 128 in the following test.
>>>>>>
>>>>>> Test cmd:
>>>>>>            fio \
>>>>>>            -filename=/dev/$dev \
>>>>>>            -name=test \
>>>>>>            -ioengine=psync \
>>>>>>            -allow_mounted_write=0 \
>>>>>>            -group_reporting \
>>>>>>            -direct=1 \
>>>>>>            -offset_increment=1g \
>>>>>>            -rw=randwrite \
>>>>>>            -bs=1024k \
>>>>>>            -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>>>            -runtime=110 \
>>>>>>            -ramp_time=10
>>>>>>
>>>>>> Test result: MiB/s
>>>>>>
>>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>>>> | ------- | --------- | ----------------- |
>>>>>> | 1       | 67.7      | 67.7              |
>>>>>> | 2       | 67.7      | 67.7              |
>>>>>> | 4       | 67.7      | 67.7              |
>>>>>> | 8       | 67.7      | 67.7              |
>>>>>> | 16      | 64.8      | 65.6              |
>>>>>> | 32      | 59.8      | 63.8              |
>>>>>> | 64      | 54.9      | 59.4              |
>>>>>> | 128     | 49        | 56.9              |
>>>>>> | 256     | 37.7      | 58.3              |
>>>>>> | 512     | 31.8      | 57.9              |
>>>>
>>>> Device write cache disabled ?
>>>>
>>>> Also, what is the max QD of this disk ?
>>>>
>>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>>>> tags. So for any of your tests with more than 64 threads, many of the
>>>> threads will be waiting for a scheduler tag for the BIO before the
>>>> bio_split problem you explain triggers. Given that the numbers you show
>>>> are the same for before-after patch with a number of threads <= 64, I am
>>>> tempted to think that the problem is not really BIO splitting...
>>>>
>>>> What about random read workloads ? What kind of results do you see ?
>>>
>>> Hi,
>>>
>>> Sorry about the misleading of this test case.
>>>
>>> This testcase is high concurrency huge randwrite, it's just for the
>>> problem that split bios won't be issued continuously, which is the
>>> root cause of the performance degradation as the numjobs increases.
>>>
>>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
>>> than 8, performance is fine, because the ratio of sequential io should
>>> be 7/8. However, as numjobs increases, performance is worse because
>>> the ratio is lower. For example, when numjobs is 512, the ratio of
>>> sequential io is about 20%.
>>
>> But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
>> All other jobs will be waiting for a scheduler tag before being able to
>> issue their large BIO. No ?
> 
> Hi,
> 
> It's right.
> 
> In fact, after this patchset, since each large io will need total 8
> tags, only 8 jobs can be in the queue while others are waiting for
> scheduler tag.
> 
>>
>> It sounds like the set of scheduler tags should be a bit more elastic:
>> always allow BIOs from a split of a large BIO to be submitted (that is to
>> get a scheduler tag) even if that causes a temporary excess of the number
>> of requests beyond the default number of scheduler tags. Doing so, all
>> fragments of a large BIOs can be queued immediately. From there, if the
>> scheduler operates correctly, all the requests from the large BIOs split
>> would be issued in sequence to the device.
> 
> This solution sounds feasible in theory, however, I'm not sure yet how
> to implement that 'temporary excess'.

It should not be too hard.

By the way, did you check that doing something like:

echo 2048 > /sys/block/sdX/queue/nr_requests

improves performance for your high number of jobs test case ?

> 
> Thanks,
> Kuai
>>
>>
>>>
>>> patch 6-8 will let split bios still be issued continuously under high
>>> pressure.
>>>
>>> Thanks,
>>> Kuai
>>>
>>
>>
Yu Kuai April 25, 2022, 7:05 a.m. UTC | #9
在 2022/04/25 14:50, Damien Le Moal 写道:
> On 4/25/22 15:47, yukuai (C) wrote:
>> 在 2022/04/25 14:23, Damien Le Moal 写道:
>>> On 4/25/22 15:14, yukuai (C) wrote:
>>>> 在 2022/04/25 11:24, Damien Le Moal 写道:
>>>>> On 4/24/22 11:43, yukuai (C) wrote:
>>>>>> friendly ping ...
>>>>>>
>>>>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>>>>> Changes in v3:
>>>>>>>      - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>>>>      in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>>>>      'waiters_cnt' are all 0, which will cause deap loop.
>>>>>>>      - don't add 'wait_index' during each loop in patch 2
>>>>>>>      - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>>>>      also improving coding for the patch.
>>>>>>>      - add a detection in patch 4 in case io hung is triggered in corner
>>>>>>>      cases.
>>>>>>>      - make the detection, free tags are sufficient, more flexible.
>>>>>>>      - fix a race in patch 8.
>>>>>>>      - fix some words and add some comments.
>>>>>>>
>>>>>>> Changes in v2:
>>>>>>>      - use a new title
>>>>>>>      - add patches to fix waitqueues' unfairness - path 1-3
>>>>>>>      - delete patch to add queue flag
>>>>>>>      - delete patch to split big io thoroughly
>>>>>>>
>>>>>>> In this patchset:
>>>>>>>      - patch 1-3 fix waitqueues' unfairness.
>>>>>>>      - patch 4,5 disable tag preemption on heavy load.
>>>>>>>      - patch 6 forces tag preemption for split bios.
>>>>>>>      - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>>>>      I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>>>>      this is not a common case and not worth to optimize, I'll drop them.
>>>>>>>
>>>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>>>>> will end up discontinuous if the device is under high io pressure, while
>>>>>>> split io will still be continuous in sq, this is because:
>>>>>>>
>>>>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>>>>> to wail.
>>>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>>>>
>>>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>>>>> ios with high concurrency.
>>>>>>>
>>>>>>> Noted that there is a precondition for such performance problem:
>>>>>>> There is a certain gap between bandwidth for single io with
>>>>>>> bs=max_sectors_kb and disk upper limit.
>>>>>>>
>>>>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>>>>> heavy load. This is because 'wake_index' is not set properly in
>>>>>>> __sbq_wake_up(), see details in patch 3.
>>>>>>>
>>>>>>> Test environment:
>>>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>>>>> where 'max_sectors_kb' is 256).>>
>>>>>>> The single io performance(randwrite):
>>>>>>>
>>>>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>>>>
>>>>> These results are extremely strange, unless you are running with the
>>>>> device write cache disabled ? If you have the device write cache enabled,
>>>>> the problem you mention above would be most likely completely invisible,
>>>>> which I guess is why nobody really noticed any issue until now.
>>>>>
>>>>> Similarly, with reads, the device side read-ahead may hide the problem,
>>>>> albeit that depends on how "intelligent" the drive is at identifying
>>>>> sequential accesses.
>>>>>
>>>>>>>
>>>>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>>>>> be hard to see differences with the default value, thus I set
>>>>>>> 'max_sectors_kb' to 128 in the following test.
>>>>>>>
>>>>>>> Test cmd:
>>>>>>>             fio \
>>>>>>>             -filename=/dev/$dev \
>>>>>>>             -name=test \
>>>>>>>             -ioengine=psync \
>>>>>>>             -allow_mounted_write=0 \
>>>>>>>             -group_reporting \
>>>>>>>             -direct=1 \
>>>>>>>             -offset_increment=1g \
>>>>>>>             -rw=randwrite \
>>>>>>>             -bs=1024k \
>>>>>>>             -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>>>>             -runtime=110 \
>>>>>>>             -ramp_time=10
>>>>>>>
>>>>>>> Test result: MiB/s
>>>>>>>
>>>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>>>>> | ------- | --------- | ----------------- |
>>>>>>> | 1       | 67.7      | 67.7              |
>>>>>>> | 2       | 67.7      | 67.7              |
>>>>>>> | 4       | 67.7      | 67.7              |
>>>>>>> | 8       | 67.7      | 67.7              |
>>>>>>> | 16      | 64.8      | 65.6              |
>>>>>>> | 32      | 59.8      | 63.8              |
>>>>>>> | 64      | 54.9      | 59.4              |
>>>>>>> | 128     | 49        | 56.9              |
>>>>>>> | 256     | 37.7      | 58.3              |
>>>>>>> | 512     | 31.8      | 57.9              |
>>>>>
>>>>> Device write cache disabled ?
>>>>>
>>>>> Also, what is the max QD of this disk ?
>>>>>
>>>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>>>>> tags. So for any of your tests with more than 64 threads, many of the
>>>>> threads will be waiting for a scheduler tag for the BIO before the
>>>>> bio_split problem you explain triggers. Given that the numbers you show
>>>>> are the same for before-after patch with a number of threads <= 64, I am
>>>>> tempted to think that the problem is not really BIO splitting...
>>>>>
>>>>> What about random read workloads ? What kind of results do you see ?
>>>>
>>>> Hi,
>>>>
>>>> Sorry about the misleading of this test case.
>>>>
>>>> This testcase is high concurrency huge randwrite, it's just for the
>>>> problem that split bios won't be issued continuously, which is the
>>>> root cause of the performance degradation as the numjobs increases.
>>>>
>>>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
>>>> than 8, performance is fine, because the ratio of sequential io should
>>>> be 7/8. However, as numjobs increases, performance is worse because
>>>> the ratio is lower. For example, when numjobs is 512, the ratio of
>>>> sequential io is about 20%.
>>>
>>> But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
>>> All other jobs will be waiting for a scheduler tag before being able to
>>> issue their large BIO. No ?
>>
>> Hi,
>>
>> It's right.
>>
>> In fact, after this patchset, since each large io will need total 8
>> tags, only 8 jobs can be in the queue while others are waiting for
>> scheduler tag.
>>
>>>
>>> It sounds like the set of scheduler tags should be a bit more elastic:
>>> always allow BIOs from a split of a large BIO to be submitted (that is to
>>> get a scheduler tag) even if that causes a temporary excess of the number
>>> of requests beyond the default number of scheduler tags. Doing so, all
>>> fragments of a large BIOs can be queued immediately. From there, if the
>>> scheduler operates correctly, all the requests from the large BIOs split
>>> would be issued in sequence to the device.
>>
>> This solution sounds feasible in theory, however, I'm not sure yet how
>> to implement that 'temporary excess'.
> 
> It should not be too hard.

I'll try to figure out a proper way, in the meantime, any suggestions
would be appreciated.
> 
> By the way, did you check that doing something like:
> 
> echo 2048 > /sys/block/sdX/queue/nr_requests
> 
> improves performance for your high number of jobs test case ?

Yes, performance will not degrade when numjobs is not greater than 256
in this case.
> 
>>
>> Thanks,
>> Kuai
>>>
>>>
>>>>
>>>> patch 6-8 will let split bios still be issued continuously under high
>>>> pressure.
>>>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>
>>>
> 
>
Damien Le Moal April 25, 2022, 7:06 a.m. UTC | #10
On 4/25/22 16:05, yukuai (C) wrote:
> 在 2022/04/25 14:50, Damien Le Moal 写道:
>> On 4/25/22 15:47, yukuai (C) wrote:
>>> 在 2022/04/25 14:23, Damien Le Moal 写道:
>>>> On 4/25/22 15:14, yukuai (C) wrote:
>>>>> 在 2022/04/25 11:24, Damien Le Moal 写道:
>>>>>> On 4/24/22 11:43, yukuai (C) wrote:
>>>>>>> friendly ping ...
>>>>>>>
>>>>>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>>>>>> Changes in v3:
>>>>>>>>      - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>>>>>      in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>>>>>      'waiters_cnt' are all 0, which will cause deap loop.
>>>>>>>>      - don't add 'wait_index' during each loop in patch 2
>>>>>>>>      - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>>>>>      also improving coding for the patch.
>>>>>>>>      - add a detection in patch 4 in case io hung is triggered in corner
>>>>>>>>      cases.
>>>>>>>>      - make the detection, free tags are sufficient, more flexible.
>>>>>>>>      - fix a race in patch 8.
>>>>>>>>      - fix some words and add some comments.
>>>>>>>>
>>>>>>>> Changes in v2:
>>>>>>>>      - use a new title
>>>>>>>>      - add patches to fix waitqueues' unfairness - path 1-3
>>>>>>>>      - delete patch to add queue flag
>>>>>>>>      - delete patch to split big io thoroughly
>>>>>>>>
>>>>>>>> In this patchset:
>>>>>>>>      - patch 1-3 fix waitqueues' unfairness.
>>>>>>>>      - patch 4,5 disable tag preemption on heavy load.
>>>>>>>>      - patch 6 forces tag preemption for split bios.
>>>>>>>>      - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>>>>>      I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>>>>>      this is not a common case and not worth to optimize, I'll drop them.
>>>>>>>>
>>>>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>>>>>> will end up discontinuous if the device is under high io pressure, while
>>>>>>>> split io will still be continuous in sq, this is because:
>>>>>>>>
>>>>>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>>>>>> to wail.
>>>>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>>>>>
>>>>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>>>>>> ios with high concurrency.
>>>>>>>>
>>>>>>>> Noted that there is a precondition for such performance problem:
>>>>>>>> There is a certain gap between bandwidth for single io with
>>>>>>>> bs=max_sectors_kb and disk upper limit.
>>>>>>>>
>>>>>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>>>>>> heavy load. This is because 'wake_index' is not set properly in
>>>>>>>> __sbq_wake_up(), see details in patch 3.
>>>>>>>>
>>>>>>>> Test environment:
>>>>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>>>>>> where 'max_sectors_kb' is 256).>>
>>>>>>>> The single io performance(randwrite):
>>>>>>>>
>>>>>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>>>>>
>>>>>> These results are extremely strange, unless you are running with the
>>>>>> device write cache disabled ? If you have the device write cache enabled,
>>>>>> the problem you mention above would be most likely completely invisible,
>>>>>> which I guess is why nobody really noticed any issue until now.
>>>>>>
>>>>>> Similarly, with reads, the device side read-ahead may hide the problem,
>>>>>> albeit that depends on how "intelligent" the drive is at identifying
>>>>>> sequential accesses.
>>>>>>
>>>>>>>>
>>>>>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>>>>>> be hard to see differences with the default value, thus I set
>>>>>>>> 'max_sectors_kb' to 128 in the following test.
>>>>>>>>
>>>>>>>> Test cmd:
>>>>>>>>             fio \
>>>>>>>>             -filename=/dev/$dev \
>>>>>>>>             -name=test \
>>>>>>>>             -ioengine=psync \
>>>>>>>>             -allow_mounted_write=0 \
>>>>>>>>             -group_reporting \
>>>>>>>>             -direct=1 \
>>>>>>>>             -offset_increment=1g \
>>>>>>>>             -rw=randwrite \
>>>>>>>>             -bs=1024k \
>>>>>>>>             -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>>>>>             -runtime=110 \
>>>>>>>>             -ramp_time=10
>>>>>>>>
>>>>>>>> Test result: MiB/s
>>>>>>>>
>>>>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>>>>>> | ------- | --------- | ----------------- |
>>>>>>>> | 1       | 67.7      | 67.7              |
>>>>>>>> | 2       | 67.7      | 67.7              |
>>>>>>>> | 4       | 67.7      | 67.7              |
>>>>>>>> | 8       | 67.7      | 67.7              |
>>>>>>>> | 16      | 64.8      | 65.6              |
>>>>>>>> | 32      | 59.8      | 63.8              |
>>>>>>>> | 64      | 54.9      | 59.4              |
>>>>>>>> | 128     | 49        | 56.9              |
>>>>>>>> | 256     | 37.7      | 58.3              |
>>>>>>>> | 512     | 31.8      | 57.9              |
>>>>>>
>>>>>> Device write cache disabled ?
>>>>>>
>>>>>> Also, what is the max QD of this disk ?
>>>>>>
>>>>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>>>>>> tags. So for any of your tests with more than 64 threads, many of the
>>>>>> threads will be waiting for a scheduler tag for the BIO before the
>>>>>> bio_split problem you explain triggers. Given that the numbers you show
>>>>>> are the same for before-after patch with a number of threads <= 64, I am
>>>>>> tempted to think that the problem is not really BIO splitting...
>>>>>>
>>>>>> What about random read workloads ? What kind of results do you see ?
>>>>>
>>>>> Hi,
>>>>>
>>>>> Sorry about the misleading of this test case.
>>>>>
>>>>> This testcase is high concurrency huge randwrite, it's just for the
>>>>> problem that split bios won't be issued continuously, which is the
>>>>> root cause of the performance degradation as the numjobs increases.
>>>>>
>>>>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
>>>>> than 8, performance is fine, because the ratio of sequential io should
>>>>> be 7/8. However, as numjobs increases, performance is worse because
>>>>> the ratio is lower. For example, when numjobs is 512, the ratio of
>>>>> sequential io is about 20%.
>>>>
>>>> But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
>>>> All other jobs will be waiting for a scheduler tag before being able to
>>>> issue their large BIO. No ?
>>>
>>> Hi,
>>>
>>> It's right.
>>>
>>> In fact, after this patchset, since each large io will need total 8
>>> tags, only 8 jobs can be in the queue while others are waiting for
>>> scheduler tag.
>>>
>>>>
>>>> It sounds like the set of scheduler tags should be a bit more elastic:
>>>> always allow BIOs from a split of a large BIO to be submitted (that is to
>>>> get a scheduler tag) even if that causes a temporary excess of the number
>>>> of requests beyond the default number of scheduler tags. Doing so, all
>>>> fragments of a large BIOs can be queued immediately. From there, if the
>>>> scheduler operates correctly, all the requests from the large BIOs split
>>>> would be issued in sequence to the device.
>>>
>>> This solution sounds feasible in theory, however, I'm not sure yet how
>>> to implement that 'temporary excess'.
>>
>> It should not be too hard.
> 
> I'll try to figure out a proper way, in the meantime, any suggestions
> would be appreciated.
>>
>> By the way, did you check that doing something like:
>>
>> echo 2048 > /sys/block/sdX/queue/nr_requests
>>
>> improves performance for your high number of jobs test case ?
> 
> Yes, performance will not degrade when numjobs is not greater than 256
> in this case.

That is my thinking as well. I am asking if did check that (did you run it ?).

>>
>>>
>>> Thanks,
>>> Kuai
>>>>
>>>>
>>>>>
>>>>> patch 6-8 will let split bios still be issued continuously under high
>>>>> pressure.
>>>>>
>>>>> Thanks,
>>>>> Kuai
>>>>>
>>>>
>>>>
>>
>>
Yu Kuai April 25, 2022, 7:28 a.m. UTC | #11
在 2022/04/25 15:06, Damien Le Moal 写道:

>>> By the way, did you check that doing something like:
>>>
>>> echo 2048 > /sys/block/sdX/queue/nr_requests
>>>
>>> improves performance for your high number of jobs test case ?
>>
>> Yes, performance will not degrade when numjobs is not greater than 256
>> in this case.
> 
> That is my thinking as well. I am asking if did check that (did you run it ?).

Hi,

I'm sure I ran it with 256 jobs before.

However, I didn't run it with 512 jobs. And following is the result I
just tested:

ratio of sequential io: 49.1%

Read|Write seek 

cnt 99338, zero cnt 48753 

     >=(KB) .. <(KB)     : count       ratio |distribution 
              |
          0 .. 1         : 48753       49.1% 
|########################################|
          1 .. 2         : 0            0.0% | 
              |
          2 .. 4         : 0            0.0% | 
              |
          4 .. 8         : 0            0.0% | 
              |
          8 .. 16        : 0            0.0% | 
              |
         16 .. 32        : 0            0.0% | 
              |
         32 .. 64        : 0            0.0% | 
              |
         64 .. 128       : 4975         5.0% |##### 
              |
        128 .. 256       : 4439         4.5% |#### 
              |
        256 .. 512       : 2615         2.6% |### 
              |
        512 .. 1024      : 967          1.0% |# 
              |
       1024 .. 2048      : 213          0.2% |# 
              |
       2048 .. 4096      : 375          0.4% |# 
              |
       4096 .. 8192      : 723          0.7% |# 
              |
       8192 .. 16384     : 1436         1.4% |## 
              |
      16384 .. 32768     : 2626         2.6% |### 
              |
      32768 .. 65536     : 4197         4.2% |#### 
              |
      65536 .. 131072    : 6431         6.5% |###### 
              |
     131072 .. 262144    : 7590         7.6% |####### 
              |
     262144 .. 524288    : 6433         6.5% |###### 
              |
     524288 .. 1048576   : 4583         4.6% |#### 
              |
    1048576 .. 2097152   : 2237         2.3% |## 
              |
    2097152 .. 4194304   : 489          0.5% |# 
              |
    4194304 .. 8388608   : 83           0.1% |# 
              |
    8388608 .. 16777216  : 36           0.0% |# 
              |
   16777216 .. 33554432  : 0            0.0% | 
              |
   33554432 .. 67108864  : 0            0.0% | 
              |
   67108864 .. 134217728 : 137          0.1% |# 
              |
Damien Le Moal April 25, 2022, 11:20 a.m. UTC | #12
On 4/25/22 16:28, yukuai (C) wrote:
> 在 2022/04/25 15:06, Damien Le Moal 写道:
> 
>>>> By the way, did you check that doing something like:
>>>>
>>>> echo 2048 > /sys/block/sdX/queue/nr_requests
>>>>
>>>> improves performance for your high number of jobs test case ?
>>>
>>> Yes, performance will not degrade when numjobs is not greater than 256
>>> in this case.
>>
>> That is my thinking as well. I am asking if did check that (did you run it ?).
> 
> Hi,
> 
> I'm sure I ran it with 256 jobs before.
> 
> However, I didn't run it with 512 jobs. And following is the result I
> just tested:

What was nr_requests ? The default 64 ?
If you increase that number, do you see better throughput/more requests
being sequential ?


> 
> ratio of sequential io: 49.1%
> 
> Read|Write seek 
> 
> cnt 99338, zero cnt 48753 
> 
>      >=(KB) .. <(KB)     : count       ratio |distribution 
>               |
>           0 .. 1         : 48753       49.1% 
> |########################################|
>           1 .. 2         : 0            0.0% | 
>               |
>           2 .. 4         : 0            0.0% | 
>               |
>           4 .. 8         : 0            0.0% | 
>               |
>           8 .. 16        : 0            0.0% | 
>               |
>          16 .. 32        : 0            0.0% | 
>               |
>          32 .. 64        : 0            0.0% | 
>               |
>          64 .. 128       : 4975         5.0% |##### 
>               |
>         128 .. 256       : 4439         4.5% |#### 
>               |
>         256 .. 512       : 2615         2.6% |### 
>               |
>         512 .. 1024      : 967          1.0% |# 
>               |
>        1024 .. 2048      : 213          0.2% |# 
>               |
>        2048 .. 4096      : 375          0.4% |# 
>               |
>        4096 .. 8192      : 723          0.7% |# 
>               |
>        8192 .. 16384     : 1436         1.4% |## 
>               |
>       16384 .. 32768     : 2626         2.6% |### 
>               |
>       32768 .. 65536     : 4197         4.2% |#### 
>               |
>       65536 .. 131072    : 6431         6.5% |###### 
>               |
>      131072 .. 262144    : 7590         7.6% |####### 
>               |
>      262144 .. 524288    : 6433         6.5% |###### 
>               |
>      524288 .. 1048576   : 4583         4.6% |#### 
>               |
>     1048576 .. 2097152   : 2237         2.3% |## 
>               |
>     2097152 .. 4194304   : 489          0.5% |# 
>               |
>     4194304 .. 8388608   : 83           0.1% |# 
>               |
>     8388608 .. 16777216  : 36           0.0% |# 
>               |
>    16777216 .. 33554432  : 0            0.0% | 
>               |
>    33554432 .. 67108864  : 0            0.0% | 
>               |
>    67108864 .. 134217728 : 137          0.1% |# 
>               |
Yu Kuai April 25, 2022, 1:42 p.m. UTC | #13
在 2022/04/25 19:20, Damien Le Moal 写道:
> On 4/25/22 16:28, yukuai (C) wrote:
>> 在 2022/04/25 15:06, Damien Le Moal 写道:
>>
>>>>> By the way, did you check that doing something like:
>>>>>
>>>>> echo 2048 > /sys/block/sdX/queue/nr_requests
>>>>>
>>>>> improves performance for your high number of jobs test case ?
>>>>
>>>> Yes, performance will not degrade when numjobs is not greater than 256
>>>> in this case.
>>>
>>> That is my thinking as well. I am asking if did check that (did you run it ?).
>>
>> Hi,
>>
>> I'm sure I ran it with 256 jobs before.
>>
>> However, I didn't run it with 512 jobs. And following is the result I
>> just tested:
> 
> What was nr_requests ? The default 64 ?
> If you increase that number, do you see better throughput/more requests
> being sequential ?

Sorry if I didn't explain this clearly.

If nr_requests is 64, numjobs is 512, the ratio of sequential is about
20%. If nr_requests is 2048, numjobs is 512, the ratio is 49.1%.

Then yes, increase nr_requests can improve performance in the test case.

> 
> 
>>
>> ratio of sequential io: 49.1%
>>
>> Read|Write seek
>>
>> cnt 99338, zero cnt 48753
>>
>>       >=(KB) .. <(KB)     : count       ratio |distribution
>>                |
>>            0 .. 1         : 48753       49.1%
>> |########################################|
>>            1 .. 2         : 0            0.0% |
>>                |
>>            2 .. 4         : 0            0.0% |
>>                |
>>            4 .. 8         : 0            0.0% |
>>                |
>>            8 .. 16        : 0            0.0% |
>>                |
>>           16 .. 32        : 0            0.0% |
>>                |
>>           32 .. 64        : 0            0.0% |
>>                |
>>           64 .. 128       : 4975         5.0% |#####
>>                |
>>          128 .. 256       : 4439         4.5% |####
>>                |
>>          256 .. 512       : 2615         2.6% |###
>>                |
>>          512 .. 1024      : 967          1.0% |#
>>                |
>>         1024 .. 2048      : 213          0.2% |#
>>                |
>>         2048 .. 4096      : 375          0.4% |#
>>                |
>>         4096 .. 8192      : 723          0.7% |#
>>                |
>>         8192 .. 16384     : 1436         1.4% |##
>>                |
>>        16384 .. 32768     : 2626         2.6% |###
>>                |
>>        32768 .. 65536     : 4197         4.2% |####
>>                |
>>        65536 .. 131072    : 6431         6.5% |######
>>                |
>>       131072 .. 262144    : 7590         7.6% |#######
>>                |
>>       262144 .. 524288    : 6433         6.5% |######
>>                |
>>       524288 .. 1048576   : 4583         4.6% |####
>>                |
>>      1048576 .. 2097152   : 2237         2.3% |##
>>                |
>>      2097152 .. 4194304   : 489          0.5% |#
>>                |
>>      4194304 .. 8388608   : 83           0.1% |#
>>                |
>>      8388608 .. 16777216  : 36           0.0% |#
>>                |
>>     16777216 .. 33554432  : 0            0.0% |
>>                |
>>     33554432 .. 67108864  : 0            0.0% |
>>                |
>>     67108864 .. 134217728 : 137          0.1% |#
>>                |
> 
>