diff mbox series

bio: limit bio max size.

Message ID 20210112083311.30013-1-nanich.lee@samsung.com (mailing list archive)
State New, archived
Headers show
Series bio: limit bio max size. | expand

Commit Message

Changheun Lee Jan. 12, 2021, 8:33 a.m. UTC
From: "Changheun Lee" <nanich.lee@samsung.com>

bio size can grow up to 4GB when muli-page bvec is enabled.
but sometimes it would lead to inefficient behaviors.
in case of large chunk direct I/O, - 64MB chunk read in user space -
all pages for 64MB would be merged to a bio structure if memory address is
continued phsycally. it makes some delay to submit until merge complete.
bio max size should be limited as a proper size.

Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
---
 block/bio.c         | 2 +-
 include/linux/bio.h | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

Comments

Damien Le Moal Jan. 12, 2021, 9:16 a.m. UTC | #1
On 2021/01/12 17:52, Changheun Lee wrote:
> From: "Changheun Lee" <nanich.lee@samsung.com>
> 
> bio size can grow up to 4GB when muli-page bvec is enabled.
> but sometimes it would lead to inefficient behaviors.
> in case of large chunk direct I/O, - 64MB chunk read in user space -
> all pages for 64MB would be merged to a bio structure if memory address is
> continued phsycally. it makes some delay to submit until merge complete.
> bio max size should be limited as a proper size.

But merging physically contiguous pages into the same bvec + later automatic bio
split on submit should give you better throughput for large IOs compared to
having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
likely need splitting anyway (because of DMA boundaries etc).

Do you have a specific case where you see higher performance with this patch
applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
considering that many hardware can execute larger IOs than that.


> 
> Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
> ---
>  block/bio.c         | 2 +-
>  include/linux/bio.h | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 1f2cc1fbe283..dbe14d675f28 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>  
>  		if (page_is_mergeable(bv, page, len, off, same_page)) {
> -			if (bio->bi_iter.bi_size > UINT_MAX - len) {
> +			if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>  				*same_page = false;
>  				return false;
>  			}
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1edda614f7ce..0f49b354b1f6 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -20,6 +20,7 @@
>  #endif
>  
>  #define BIO_MAX_PAGES		256
> +#define BIO_MAX_SIZE		(BIO_MAX_PAGES * PAGE_SIZE)
>  
>  #define bio_prio(bio)			(bio)->bi_ioprio
>  #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
>  	if (bio->bi_vcnt >= bio->bi_max_vecs)
>  		return true;
>  
> -	if (bio->bi_iter.bi_size > UINT_MAX - len)
> +	if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>  		return true;
>  
>  	return false;
>
Changheun Lee Jan. 12, 2021, 11:58 a.m. UTC | #2
>On 2021/01/12 17:52, Changheun Lee wrote:
>> From: "Changheun Lee" <nanich.lee@samsung.com>
>> 
>> bio size can grow up to 4GB when muli-page bvec is enabled.
>> but sometimes it would lead to inefficient behaviors.
>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>> all pages for 64MB would be merged to a bio structure if memory address is
>> continued phsycally. it makes some delay to submit until merge complete.
>> bio max size should be limited as a proper size.
>
>But merging physically contiguous pages into the same bvec + later automatic bio
>split on submit should give you better throughput for large IOs compared to
>having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>likely need splitting anyway (because of DMA boundaries etc).
>
>Do you have a specific case where you see higher performance with this patch
>applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>considering that many hardware can execute larger IOs than that.
>

When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
is merged into a bio structure.
And elapsed time to merge complete was about 2ms.
It means first bio-submit is after 2ms.
If bio size is limited with 1MB with this patch, first bio-submit is about
100us by bio_full operation.
It's not large delay and can't be observed with low speed device.
But it's needed to reduce merge delay for high speed device.
I improved 512MB sequential read performance from 1900MB/s to 2000MB/s
with this patch on android platform.
As you said, 1MB might be small for some device.
But method is needed to re-size, or select the bio max size.

>
>> 
>> Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
>> ---
>>  block/bio.c         | 2 +-
>>  include/linux/bio.h | 3 ++-
>>  2 files changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/block/bio.c b/block/bio.c
>> index 1f2cc1fbe283..dbe14d675f28 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>  
>>  		if (page_is_mergeable(bv, page, len, off, same_page)) {
>> -			if (bio->bi_iter.bi_size > UINT_MAX - len) {
>> +			if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>>  				*same_page = false;
>>  				return false;
>>  			}
>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>> index 1edda614f7ce..0f49b354b1f6 100644
>> --- a/include/linux/bio.h
>> +++ b/include/linux/bio.h
>> @@ -20,6 +20,7 @@
>>  #endif
>>  
>>  #define BIO_MAX_PAGES		256
>> +#define BIO_MAX_SIZE		(BIO_MAX_PAGES * PAGE_SIZE)
>>  
>>  #define bio_prio(bio)			(bio)->bi_ioprio
>>  #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
>> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
>>  	if (bio->bi_vcnt >= bio->bi_max_vecs)
>>  		return true;
>>  
>> -	if (bio->bi_iter.bi_size > UINT_MAX - len)
>> +	if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>>  		return true;
>>  
>>  	return false;
>> 
>
>
>-- 
>Damien Le Moal
>Western Digital Research
Damien Le Moal Jan. 13, 2021, 1:16 a.m. UTC | #3
On 2021/01/12 21:14, Changheun Lee wrote:
>> On 2021/01/12 17:52, Changheun Lee wrote:
>>> From: "Changheun Lee" <nanich.lee@samsung.com>
>>>
>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>> but sometimes it would lead to inefficient behaviors.
>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>> all pages for 64MB would be merged to a bio structure if memory address is
>>> continued phsycally. it makes some delay to submit until merge complete.
>>> bio max size should be limited as a proper size.
>>
>> But merging physically contiguous pages into the same bvec + later automatic bio
>> split on submit should give you better throughput for large IOs compared to
>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>> likely need splitting anyway (because of DMA boundaries etc).
>>
>> Do you have a specific case where you see higher performance with this patch
>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>> considering that many hardware can execute larger IOs than that.
>>
> 
> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
> is merged into a bio structure.
> And elapsed time to merge complete was about 2ms.
> It means first bio-submit is after 2ms.
> If bio size is limited with 1MB with this patch, first bio-submit is about
> 100us by bio_full operation.

bio_submit() will split the large BIO case into multiple requests while the
small BIO case will likely result one or two requests only. That likely explain
the time difference here. However, for the large case, the 2ms will issue ALL
requests needed for processing the entire 32MB user IO while the 1MB bio case
will need 32 different bio_submit() calls. So what is the actual total latency
difference for the entire 32MB user IO ? That is I think what needs to be
compared here.

Also, what is your device max_sectors_kb and max queue depth ?

> It's not large delay and can't be observed with low speed device.
> But it's needed to reduce merge delay for high speed device.
> I improved 512MB sequential read performance from 1900MB/s to 2000MB/s
> with this patch on android platform.
> As you said, 1MB might be small for some device.
> But method is needed to re-size, or select the bio max size.

At the very least, I think that such limit should not be arbitrary as your patch
proposes but rely on the device characteristics (e.g.
max_hw_sectors_kb/max_sectors_kb and queue depth).

> 
>>
>>>
>>> Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
>>> ---
>>>  block/bio.c         | 2 +-
>>>  include/linux/bio.h | 3 ++-
>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index 1f2cc1fbe283..dbe14d675f28 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>>>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>  
>>>  		if (page_is_mergeable(bv, page, len, off, same_page)) {
>>> -			if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>> +			if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>>>  				*same_page = false;
>>>  				return false;
>>>  			}
>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>> index 1edda614f7ce..0f49b354b1f6 100644
>>> --- a/include/linux/bio.h
>>> +++ b/include/linux/bio.h
>>> @@ -20,6 +20,7 @@
>>>  #endif
>>>  
>>>  #define BIO_MAX_PAGES		256
>>> +#define BIO_MAX_SIZE		(BIO_MAX_PAGES * PAGE_SIZE)
>>>  
>>>  #define bio_prio(bio)			(bio)->bi_ioprio
>>>  #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
>>> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
>>>  	if (bio->bi_vcnt >= bio->bi_max_vecs)
>>>  		return true;
>>>  
>>> -	if (bio->bi_iter.bi_size > UINT_MAX - len)
>>> +	if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>>>  		return true;
>>>  
>>>  	return false;
>>>
>>
>>
>> -- 
>> Damien Le Moal
>> Western Digital Research
>
Changheun Lee Jan. 13, 2021, 3:46 a.m. UTC | #4
>On 2021/01/12 21:14, Changheun Lee wrote:
>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
>>>>
>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>> but sometimes it would lead to inefficient behaviors.
>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>> all pages for 64MB would be merged to a bio structure if memory address is
>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>> bio max size should be limited as a proper size.
>>>
>>> But merging physically contiguous pages into the same bvec + later automatic bio
>>> split on submit should give you better throughput for large IOs compared to
>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>>> likely need splitting anyway (because of DMA boundaries etc).
>>>
>>> Do you have a specific case where you see higher performance with this patch
>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>>> considering that many hardware can execute larger IOs than that.
>>>
>> 
>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>> is merged into a bio structure.
>> And elapsed time to merge complete was about 2ms.
>> It means first bio-submit is after 2ms.
>> If bio size is limited with 1MB with this patch, first bio-submit is about
>> 100us by bio_full operation.
>
>bio_submit() will split the large BIO case into multiple requests while the
>small BIO case will likely result one or two requests only. That likely explain
>the time difference here. However, for the large case, the 2ms will issue ALL
>requests needed for processing the entire 32MB user IO while the 1MB bio case
>will need 32 different bio_submit() calls. So what is the actual total latency
>difference for the entire 32MB user IO ? That is I think what needs to be
>compared here.
>
>Also, what is your device max_sectors_kb and max queue depth ?
>

32MB total latency is about 19ms including merge time without this patch.
But with this patch, total latency is about 17ms including merge time too.
Actually 32MB read time from device is same - about 16.7ms - in driver layer.
No need to hold more I/O than max_sectors_kb during bio merge.
My device is UFS. and max_sectors_kb is 1MB, queue depth is 32.

>> It's not large delay and can't be observed with low speed device.
>> But it's needed to reduce merge delay for high speed device.
>> I improved 512MB sequential read performance from 1900MB/s to 2000MB/s
>> with this patch on android platform.
>> As you said, 1MB might be small for some device.
>> But method is needed to re-size, or select the bio max size.
>
>At the very least, I think that such limit should not be arbitrary as your patch
>proposes but rely on the device characteristics (e.g.
>max_hw_sectors_kb/max_sectors_kb and queue depth).
>

I agree with your opinion, I thought same as your idea. For that, deep research
is needed, proper timing to set and bio structure modification, etc ...
Current is simple patch for default bio max size.
Before applying of multipage bvec, bio max size was 1MB in kernel 4.x by BIO_MAX_PAGES.
So I think 1MB bio max size is reasonable as a default.

>> 
>>>
>>>>
>>>> Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
>>>> ---
>>>>  block/bio.c         | 2 +-
>>>>  include/linux/bio.h | 3 ++-
>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/block/bio.c b/block/bio.c
>>>> index 1f2cc1fbe283..dbe14d675f28 100644
>>>> --- a/block/bio.c
>>>> +++ b/block/bio.c
>>>> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>>>>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>>  
>>>>  		if (page_is_mergeable(bv, page, len, off, same_page)) {
>>>> -			if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>>> +			if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>>>>  				*same_page = false;
>>>>  				return false;
>>>>  			}
>>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>>> index 1edda614f7ce..0f49b354b1f6 100644
>>>> --- a/include/linux/bio.h
>>>> +++ b/include/linux/bio.h
>>>> @@ -20,6 +20,7 @@
>>>>  #endif
>>>>  
>>>>  #define BIO_MAX_PAGES		256
>>>> +#define BIO_MAX_SIZE		(BIO_MAX_PAGES * PAGE_SIZE)
>>>>  
>>>>  #define bio_prio(bio)			(bio)->bi_ioprio
>>>>  #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
>>>> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
>>>>  	if (bio->bi_vcnt >= bio->bi_max_vecs)
>>>>  		return true;
>>>>  
>>>> -	if (bio->bi_iter.bi_size > UINT_MAX - len)
>>>> +	if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>>>>  		return true;
>>>>  
>>>>  	return false;
>>>>
>>>
>>>
>>> -- 
>>> Damien Le Moal
>>> Western Digital Research
>> 
>
>
>-- 
>Damien Le Moal
>Western Digital Research
>

---
Changheun Lee
Samsung Electronics
Damien Le Moal Jan. 13, 2021, 5:53 a.m. UTC | #5
On 2021/01/13 13:01, Changheun Lee wrote:
>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
>>>>>
>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>> but sometimes it would lead to inefficient behaviors.
>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>> all pages for 64MB would be merged to a bio structure if memory address is
>>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>>> bio max size should be limited as a proper size.
>>>>
>>>> But merging physically contiguous pages into the same bvec + later automatic bio
>>>> split on submit should give you better throughput for large IOs compared to
>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>
>>>> Do you have a specific case where you see higher performance with this patch
>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>>>> considering that many hardware can execute larger IOs than that.
>>>>
>>>
>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>> is merged into a bio structure.
>>> And elapsed time to merge complete was about 2ms.
>>> It means first bio-submit is after 2ms.
>>> If bio size is limited with 1MB with this patch, first bio-submit is about
>>> 100us by bio_full operation.
>>
>> bio_submit() will split the large BIO case into multiple requests while the
>> small BIO case will likely result one or two requests only. That likely explain
>> the time difference here. However, for the large case, the 2ms will issue ALL
>> requests needed for processing the entire 32MB user IO while the 1MB bio case
>> will need 32 different bio_submit() calls. So what is the actual total latency
>> difference for the entire 32MB user IO ? That is I think what needs to be
>> compared here.
>>
>> Also, what is your device max_sectors_kb and max queue depth ?
>>
> 
> 32MB total latency is about 19ms including merge time without this patch.
> But with this patch, total latency is about 17ms including merge time too.
> Actually 32MB read time from device is same - about 16.7ms - in driver layer.
> No need to hold more I/O than max_sectors_kb during bio merge.
> My device is UFS. and max_sectors_kb is 1MB, queue depth is 32.

This may be due to the CPU being slow: it takes time to build the 32MB BIO,
during which the device is idling. With the 1MB BIO limit, the first BIO hits
the drive much more quickly, reducing idle time of the device and leading to
higher throughput. But there are 31 more BIOs to build and issue after the first
one... That does create a pipeline of requests keeping the device busy, but that
also likely keeps your CPU a lot more busy building these additional BIOs.
Overall, do you see a difference in CPU load for the 32MB BIO case vs the 1MB
max BIO case ? Any increase in CPU load with the lower BIO size limit ?

> 
>>> It's not large delay and can't be observed with low speed device.
>>> But it's needed to reduce merge delay for high speed device.
>>> I improved 512MB sequential read performance from 1900MB/s to 2000MB/s
>>> with this patch on android platform.
>>> As you said, 1MB might be small for some device.
>>> But method is needed to re-size, or select the bio max size.
>>
>> At the very least, I think that such limit should not be arbitrary as your patch
>> proposes but rely on the device characteristics (e.g.
>> max_hw_sectors_kb/max_sectors_kb and queue depth).
>>
> 
> I agree with your opinion, I thought same as your idea. For that, deep research
> is needed, proper timing to set and bio structure modification, etc ...

Why would you need any BIO structure modifications ? Your patch is on the right
track if limiting the BIO size is the right solution (I am not still completely
convinced). E.g., the code:

	if (page_is_mergeable(bv, page, len, off, same_page)) {
		if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
			*same_page = false;
			return false;
	}

could just become:

	if (page_is_mergeable(bv, page, len, off, same_page)) {
		if (bio->bi_iter.bi_size > bio_max_size(bio) - len) {
			*same_page = false;
			return false;
	}

With bio_max_size() being something like:

static inline size_t bio_max_size(struct bio *bio)
{
	sector_t max_sectors = blk_queue_get_max_sectors(bio->bi_disk->queue,
						         bio_op(bio));

	return max_sectors << SECTOR_SHIFT;
}

Note that this is not super efficient as a BIO maximum size depends on the BIO
offset too (its start sector). So writing something similar to
blk_rq_get_max_sectors() would probably be better.

> Current is simple patch for default bio max size.
> Before applying of multipage bvec, bio max size was 1MB in kernel 4.x by BIO_MAX_PAGES.
> So I think 1MB bio max size is reasonable as a default.

max_sectors_kb is always defined for any block device so I do not think there is
a need for any arbitrary default value.

Since such optimization likely very much depend on the speed of the system CPU
and of the storage device used, it may be a good idea to have this configurable
through sysfs. That is, bio_max_size() simply returns UINT_MAX leading to no
change from the current behavior if the optimization is disabled (default) and
max_sectors_kb if it is enabled.

> 
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
>>>>> ---
>>>>>  block/bio.c         | 2 +-
>>>>>  include/linux/bio.h | 3 ++-
>>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>> index 1f2cc1fbe283..dbe14d675f28 100644
>>>>> --- a/block/bio.c
>>>>> +++ b/block/bio.c
>>>>> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>>>>>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>>>  
>>>>>  		if (page_is_mergeable(bv, page, len, off, same_page)) {
>>>>> -			if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>>>> +			if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>>>>>  				*same_page = false;
>>>>>  				return false;
>>>>>  			}
>>>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>>>> index 1edda614f7ce..0f49b354b1f6 100644
>>>>> --- a/include/linux/bio.h
>>>>> +++ b/include/linux/bio.h
>>>>> @@ -20,6 +20,7 @@
>>>>>  #endif
>>>>>  
>>>>>  #define BIO_MAX_PAGES		256
>>>>> +#define BIO_MAX_SIZE		(BIO_MAX_PAGES * PAGE_SIZE)
>>>>>  
>>>>>  #define bio_prio(bio)			(bio)->bi_ioprio
>>>>>  #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
>>>>> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
>>>>>  	if (bio->bi_vcnt >= bio->bi_max_vecs)
>>>>>  		return true;
>>>>>  
>>>>> -	if (bio->bi_iter.bi_size > UINT_MAX - len)
>>>>> +	if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>>>>>  		return true;
>>>>>  
>>>>>  	return false;
>>>>>
>>>>
>>>>
>>>> -- 
>>>> Damien Le Moal
>>>> Western Digital Research
>>>
>>
>>
>> -- 
>> Damien Le Moal
>> Western Digital Research
>>
> 
> ---
> Changheun Lee
> Samsung Electronics
> 
>
Changheun Lee Jan. 13, 2021, 6:39 a.m. UTC | #6
>On 2021/01/13 13:01, Changheun Lee wrote:
>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
>>>>>>
>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>>> all pages for 64MB would be merged to a bio structure if memory address is
>>>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>>>> bio max size should be limited as a proper size.
>>>>>
>>>>> But merging physically contiguous pages into the same bvec + later automatic bio
>>>>> split on submit should give you better throughput for large IOs compared to
>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>
>>>>> Do you have a specific case where you see higher performance with this patch
>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>>>>> considering that many hardware can execute larger IOs than that.
>>>>>
>>>>
>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>>> is merged into a bio structure.
>>>> And elapsed time to merge complete was about 2ms.
>>>> It means first bio-submit is after 2ms.
>>>> If bio size is limited with 1MB with this patch, first bio-submit is about
>>>> 100us by bio_full operation.
>>>
>>> bio_submit() will split the large BIO case into multiple requests while the
>>> small BIO case will likely result one or two requests only. That likely explain
>>> the time difference here. However, for the large case, the 2ms will issue ALL
>>> requests needed for processing the entire 32MB user IO while the 1MB bio case
>>> will need 32 different bio_submit() calls. So what is the actual total latency
>>> difference for the entire 32MB user IO ? That is I think what needs to be
>>> compared here.
>>>
>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>
>> 
>> 32MB total latency is about 19ms including merge time without this patch.
>> But with this patch, total latency is about 17ms including merge time too.
>> Actually 32MB read time from device is same - about 16.7ms - in driver layer.
>> No need to hold more I/O than max_sectors_kb during bio merge.
>> My device is UFS. and max_sectors_kb is 1MB, queue depth is 32.
>
>This may be due to the CPU being slow: it takes time to build the 32MB BIO,
>during which the device is idling. With the 1MB BIO limit, the first BIO hits
>the drive much more quickly, reducing idle time of the device and leading to
>higher throughput. But there are 31 more BIOs to build and issue after the first
>one... That does create a pipeline of requests keeping the device busy, but that
>also likely keeps your CPU a lot more busy building these additional BIOs.
>Overall, do you see a difference in CPU load for the 32MB BIO case vs the 1MB
>max BIO case ? Any increase in CPU load with the lower BIO size limit ?
>

CPU load is more than 32MB bio size. Because bio_sumbit progress is doing in parallel.
But I tested it same CPU operation frequency, So there are no difference of CPU speed.
Logically, bio max size is 4GB now. it's not realistic I know, but 4GB merge to a bio
will takes much time even if CPU works fast.
This is why I think bio max size should be limited.

>> 
>>>> It's not large delay and can't be observed with low speed device.
>>>> But it's needed to reduce merge delay for high speed device.
>>>> I improved 512MB sequential read performance from 1900MB/s to 2000MB/s
>>>> with this patch on android platform.
>>>> As you said, 1MB might be small for some device.
>>>> But method is needed to re-size, or select the bio max size.
>>>
>>> At the very least, I think that such limit should not be arbitrary as your patch
>>> proposes but rely on the device characteristics (e.g.
>>> max_hw_sectors_kb/max_sectors_kb and queue depth).
>>>
>> 
>> I agree with your opinion, I thought same as your idea. For that, deep research
>> is needed, proper timing to set and bio structure modification, etc ...
>
>Why would you need any BIO structure modifications ? Your patch is on the right
>track if limiting the BIO size is the right solution (I am not still completely
>convinced). E.g., the code:
>
>if (page_is_mergeable(bv, page, len, off, same_page)) {
>if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>*same_page = false;
>return false;
>}
>
>could just become:
>
>if (page_is_mergeable(bv, page, len, off, same_page)) {
>if (bio->bi_iter.bi_size > bio_max_size(bio) - len) {
>*same_page = false;
>return false;
>}
>
>With bio_max_size() being something like:
>
>static inline size_t bio_max_size(struct bio *bio)
>{
>sector_t max_sectors = blk_queue_get_max_sectors(bio->bi_disk->queue,
>bio_op(bio));
>
>return max_sectors << SECTOR_SHIFT;
>}
>
>Note that this is not super efficient as a BIO maximum size depends on the BIO
>offset too (its start sector). So writing something similar to
>blk_rq_get_max_sectors() would probably be better.

Good suggestion. :)

>
>> Current is simple patch for default bio max size.
>> Before applying of multipage bvec, bio max size was 1MB in kernel 4.x by BIO_MAX_PAGES.
>> So I think 1MB bio max size is reasonable as a default.
>
>max_sectors_kb is always defined for any block device so I do not think there is
>a need for any arbitrary default value.
>
>Since such optimization likely very much depend on the speed of the system CPU
>and of the storage device used, it may be a good idea to have this configurable
>through sysfs. That is, bio_max_size() simply returns UINT_MAX leading to no
>change from the current behavior if the optimization is disabled (default) and
>max_sectors_kb if it is enabled.
>

OK, I agree with you. It will be best for all now.
I'll try to make this.

>> 
>>>>
>>>>>
>>>>>>
>>>>>> Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
>>>>>> ---
>>>>>>  block/bio.c         | 2 +-
>>>>>>  include/linux/bio.h | 3 ++-
>>>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>> index 1f2cc1fbe283..dbe14d675f28 100644
>>>>>> --- a/block/bio.c
>>>>>> +++ b/block/bio.c
>>>>>> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>>>>>>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>>>>  
>>>>>>  		if (page_is_mergeable(bv, page, len, off, same_page)) {
>>>>>> -			if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>>>>> +			if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>>>>>>  				*same_page = false;
>>>>>>  				return false;
>>>>>>  			}
>>>>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>>>>> index 1edda614f7ce..0f49b354b1f6 100644
>>>>>> --- a/include/linux/bio.h
>>>>>> +++ b/include/linux/bio.h
>>>>>> @@ -20,6 +20,7 @@
>>>>>>  #endif
>>>>>>  
>>>>>>  #define BIO_MAX_PAGES		256
>>>>>> +#define BIO_MAX_SIZE		(BIO_MAX_PAGES * PAGE_SIZE)
>>>>>>  
>>>>>>  #define bio_prio(bio)			(bio)->bi_ioprio
>>>>>>  #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
>>>>>> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
>>>>>>  	if (bio->bi_vcnt >= bio->bi_max_vecs)
>>>>>>  		return true;
>>>>>>  
>>>>>> -	if (bio->bi_iter.bi_size > UINT_MAX - len)
>>>>>> +	if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>>>>>>  		return true;
>>>>>>  
>>>>>>  	return false;
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Damien Le Moal
>>>>> Western Digital Research
>>>>
>>>
>>>
>>> -- 
>>> Damien Le Moal
>>> Western Digital Research
>>>
>> 
>> ---
>> Changheun Lee
>> Samsung Electronics
>> 
>> 
>
>
>-- 
>Damien Le Moal
>Western Digital Research
>

---
Changheun Lee
Samsung Electronics
Damien Le Moal Jan. 13, 2021, 7:12 a.m. UTC | #7
On 2021/01/13 15:54, Changheun Lee wrote:
>> On 2021/01/13 13:01, Changheun Lee wrote:
>>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
>>>>>>>
>>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>>>> all pages for 64MB would be merged to a bio structure if memory address is
>>>>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>>>>> bio max size should be limited as a proper size.
>>>>>>
>>>>>> But merging physically contiguous pages into the same bvec + later automatic bio
>>>>>> split on submit should give you better throughput for large IOs compared to
>>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>>
>>>>>> Do you have a specific case where you see higher performance with this patch
>>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>>>>>> considering that many hardware can execute larger IOs than that.
>>>>>>
>>>>>
>>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>>>> is merged into a bio structure.
>>>>> And elapsed time to merge complete was about 2ms.
>>>>> It means first bio-submit is after 2ms.
>>>>> If bio size is limited with 1MB with this patch, first bio-submit is about
>>>>> 100us by bio_full operation.
>>>>
>>>> bio_submit() will split the large BIO case into multiple requests while the
>>>> small BIO case will likely result one or two requests only. That likely explain
>>>> the time difference here. However, for the large case, the 2ms will issue ALL
>>>> requests needed for processing the entire 32MB user IO while the 1MB bio case
>>>> will need 32 different bio_submit() calls. So what is the actual total latency
>>>> difference for the entire 32MB user IO ? That is I think what needs to be
>>>> compared here.
>>>>
>>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>>
>>>
>>> 32MB total latency is about 19ms including merge time without this patch.
>>> But with this patch, total latency is about 17ms including merge time too.
>>> Actually 32MB read time from device is same - about 16.7ms - in driver layer.
>>> No need to hold more I/O than max_sectors_kb during bio merge.
>>> My device is UFS. and max_sectors_kb is 1MB, queue depth is 32.
>>
>> This may be due to the CPU being slow: it takes time to build the 32MB BIO,
>> during which the device is idling. With the 1MB BIO limit, the first BIO hits
>> the drive much more quickly, reducing idle time of the device and leading to
>> higher throughput. But there are 31 more BIOs to build and issue after the first
>> one... That does create a pipeline of requests keeping the device busy, but that
>> also likely keeps your CPU a lot more busy building these additional BIOs.
>> Overall, do you see a difference in CPU load for the 32MB BIO case vs the 1MB
>> max BIO case ? Any increase in CPU load with the lower BIO size limit ?
>>
> 
> CPU load is more than 32MB bio size. Because bio_sumbit progress is doing in parallel.
> But I tested it same CPU operation frequency, So there are no difference of CPU speed.
> Logically, bio max size is 4GB now. it's not realistic I know, but 4GB merge to a bio
> will takes much time even if CPU works fast.
> This is why I think bio max size should be limited.

I do not think that the page merging code is the real root cause here. That is
fast and does not depend on the current size of the BIO. This is essentially an
O(1) operation. The root cause of your performance drop is most likely the fact
that your device is kept idle while the large BIO is being built, adding the
8192 pages of the large 32MB user IO. Building that large BIO is a lot more
efficient, CPU wise, than building and issuing a lot of small BIOs. That gives a
lot of benefits on high-end desktops and servers with fast CPUs, but is counter
productive in your case with a slower CPU.

I wonder: what is the user IO size when you start seeing a performance drop
without the patch ? It is clear that limiting the BIO size does imporve things
for the 32MB IO size you tested, but what about more realistic workloads with
128K or so IO sizes (typical IO size for an FS using the page cache) ?

> 
>>>
>>>>> It's not large delay and can't be observed with low speed device.
>>>>> But it's needed to reduce merge delay for high speed device.
>>>>> I improved 512MB sequential read performance from 1900MB/s to 2000MB/s
>>>>> with this patch on android platform.
>>>>> As you said, 1MB might be small for some device.
>>>>> But method is needed to re-size, or select the bio max size.
>>>>
>>>> At the very least, I think that such limit should not be arbitrary as your patch
>>>> proposes but rely on the device characteristics (e.g.
>>>> max_hw_sectors_kb/max_sectors_kb and queue depth).
>>>>
>>>
>>> I agree with your opinion, I thought same as your idea. For that, deep research
>>> is needed, proper timing to set and bio structure modification, etc ...
>>
>> Why would you need any BIO structure modifications ? Your patch is on the right
>> track if limiting the BIO size is the right solution (I am not still completely
>> convinced). E.g., the code:
>>
>> if (page_is_mergeable(bv, page, len, off, same_page)) {
>> if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>> *same_page = false;
>> return false;
>> }
>>
>> could just become:
>>
>> if (page_is_mergeable(bv, page, len, off, same_page)) {
>> if (bio->bi_iter.bi_size > bio_max_size(bio) - len) {
>> *same_page = false;
>> return false;
>> }
>>
>> With bio_max_size() being something like:
>>
>> static inline size_t bio_max_size(struct bio *bio)
>> {
>> sector_t max_sectors = blk_queue_get_max_sectors(bio->bi_disk->queue,
>> bio_op(bio));
>>
>> return max_sectors << SECTOR_SHIFT;
>> }
>>
>> Note that this is not super efficient as a BIO maximum size depends on the BIO
>> offset too (its start sector). So writing something similar to
>> blk_rq_get_max_sectors() would probably be better.
> 
> Good suggestion. :)
> 
>>
>>> Current is simple patch for default bio max size.
>>> Before applying of multipage bvec, bio max size was 1MB in kernel 4.x by BIO_MAX_PAGES.
>>> So I think 1MB bio max size is reasonable as a default.
>>
>> max_sectors_kb is always defined for any block device so I do not think there is
>> a need for any arbitrary default value.
>>
>> Since such optimization likely very much depend on the speed of the system CPU
>> and of the storage device used, it may be a good idea to have this configurable
>> through sysfs. That is, bio_max_size() simply returns UINT_MAX leading to no
>> change from the current behavior if the optimization is disabled (default) and
>> max_sectors_kb if it is enabled.
>>
> 
> OK, I agree with you. It will be best for all now.
> I'll try to make this.
> 
>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
>>>>>>> ---
>>>>>>>  block/bio.c         | 2 +-
>>>>>>>  include/linux/bio.h | 3 ++-
>>>>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>> index 1f2cc1fbe283..dbe14d675f28 100644
>>>>>>> --- a/block/bio.c
>>>>>>> +++ b/block/bio.c
>>>>>>> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>>>>>>>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>>>>>  
>>>>>>>  		if (page_is_mergeable(bv, page, len, off, same_page)) {
>>>>>>> -			if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>>>>>> +			if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>>>>>>>  				*same_page = false;
>>>>>>>  				return false;
>>>>>>>  			}
>>>>>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>>>>>> index 1edda614f7ce..0f49b354b1f6 100644
>>>>>>> --- a/include/linux/bio.h
>>>>>>> +++ b/include/linux/bio.h
>>>>>>> @@ -20,6 +20,7 @@
>>>>>>>  #endif
>>>>>>>  
>>>>>>>  #define BIO_MAX_PAGES		256
>>>>>>> +#define BIO_MAX_SIZE		(BIO_MAX_PAGES * PAGE_SIZE)
>>>>>>>  
>>>>>>>  #define bio_prio(bio)			(bio)->bi_ioprio
>>>>>>>  #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
>>>>>>> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
>>>>>>>  	if (bio->bi_vcnt >= bio->bi_max_vecs)
>>>>>>>  		return true;
>>>>>>>  
>>>>>>> -	if (bio->bi_iter.bi_size > UINT_MAX - len)
>>>>>>> +	if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>>>>>>>  		return true;
>>>>>>>  
>>>>>>>  	return false;
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Damien Le Moal
>>>>>> Western Digital Research
>>>>>
>>>>
>>>>
>>>> -- 
>>>> Damien Le Moal
>>>> Western Digital Research
>>>>
>>>
>>> ---
>>> Changheun Lee
>>> Samsung Electronics
>>>
>>>
>>
>>
>> -- 
>> Damien Le Moal
>> Western Digital Research
>>
> 
> ---
> Changheun Lee
> Samsung Electronics
> 
>
Ming Lei Jan. 13, 2021, 9:19 a.m. UTC | #8
On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee <nanich.lee@samsung.com> wrote:
>
> >On 2021/01/12 21:14, Changheun Lee wrote:
> >>> On 2021/01/12 17:52, Changheun Lee wrote:
> >>>> From: "Changheun Lee" <nanich.lee@samsung.com>
> >>>>
> >>>> bio size can grow up to 4GB when muli-page bvec is enabled.
> >>>> but sometimes it would lead to inefficient behaviors.
> >>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
> >>>> all pages for 64MB would be merged to a bio structure if memory address is
> >>>> continued phsycally. it makes some delay to submit until merge complete.
> >>>> bio max size should be limited as a proper size.
> >>>
> >>> But merging physically contiguous pages into the same bvec + later automatic bio
> >>> split on submit should give you better throughput for large IOs compared to
> >>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
> >>> likely need splitting anyway (because of DMA boundaries etc).
> >>>
> >>> Do you have a specific case where you see higher performance with this patch
> >>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
> >>> considering that many hardware can execute larger IOs than that.
> >>>
> >>
> >> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
> >> is merged into a bio structure.
> >> And elapsed time to merge complete was about 2ms.
> >> It means first bio-submit is after 2ms.
> >> If bio size is limited with 1MB with this patch, first bio-submit is about
> >> 100us by bio_full operation.
> >
> >bio_submit() will split the large BIO case into multiple requests while the
> >small BIO case will likely result one or two requests only. That likely explain
> >the time difference here. However, for the large case, the 2ms will issue ALL
> >requests needed for processing the entire 32MB user IO while the 1MB bio case
> >will need 32 different bio_submit() calls. So what is the actual total latency
> >difference for the entire 32MB user IO ? That is I think what needs to be
> >compared here.
> >
> >Also, what is your device max_sectors_kb and max queue depth ?
> >
>
> 32MB total latency is about 19ms including merge time without this patch.
> But with this patch, total latency is about 17ms including merge time too.

19ms looks too big just for preparing one 32MB sized bio, which isn't
supposed to
take so long.  Can you investigate where the 19ms is taken just for
preparing one
32MB sized bio?

It might be iov_iter_get_pages() for handling page fault. If yes, one suggestion
is to enable THP(Transparent HugePage Support) in your application.
Damien Le Moal Jan. 13, 2021, 9:28 a.m. UTC | #9
On 2021/01/13 18:19, Ming Lei wrote:
> On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee <nanich.lee@samsung.com> wrote:
>>
>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
>>>>>>
>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>>> all pages for 64MB would be merged to a bio structure if memory address is
>>>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>>>> bio max size should be limited as a proper size.
>>>>>
>>>>> But merging physically contiguous pages into the same bvec + later automatic bio
>>>>> split on submit should give you better throughput for large IOs compared to
>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>
>>>>> Do you have a specific case where you see higher performance with this patch
>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>>>>> considering that many hardware can execute larger IOs than that.
>>>>>
>>>>
>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>>> is merged into a bio structure.
>>>> And elapsed time to merge complete was about 2ms.
>>>> It means first bio-submit is after 2ms.
>>>> If bio size is limited with 1MB with this patch, first bio-submit is about
>>>> 100us by bio_full operation.
>>>
>>> bio_submit() will split the large BIO case into multiple requests while the
>>> small BIO case will likely result one or two requests only. That likely explain
>>> the time difference here. However, for the large case, the 2ms will issue ALL
>>> requests needed for processing the entire 32MB user IO while the 1MB bio case
>>> will need 32 different bio_submit() calls. So what is the actual total latency
>>> difference for the entire 32MB user IO ? That is I think what needs to be
>>> compared here.
>>>
>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>
>>
>> 32MB total latency is about 19ms including merge time without this patch.
>> But with this patch, total latency is about 17ms including merge time too.
> 
> 19ms looks too big just for preparing one 32MB sized bio, which isn't
> supposed to
> take so long.  Can you investigate where the 19ms is taken just for
> preparing one
> 32MB sized bio?

Changheun mentioned that the device side IO latency is 16.7ms out of the 19ms
total. So the BIO handling, submission+completion takes about 2.3ms, and
Changheun points above to 2ms for the submission part.

> 
> It might be iov_iter_get_pages() for handling page fault. If yes, one suggestion
> is to enable THP(Transparent HugePage Support) in your application.

But if that was due to page faults, the same large-ish time would be taken for
the preparing the size-limited BIOs too, no ? No matter how the BIOs are diced,
all 32MB of pages of the user IO are referenced...

> 
>
Ming Lei Jan. 13, 2021, 10:24 a.m. UTC | #10
On Wed, Jan 13, 2021 at 09:28:02AM +0000, Damien Le Moal wrote:
> On 2021/01/13 18:19, Ming Lei wrote:
> > On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee <nanich.lee@samsung.com> wrote:
> >>
> >>> On 2021/01/12 21:14, Changheun Lee wrote:
> >>>>> On 2021/01/12 17:52, Changheun Lee wrote:
> >>>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
> >>>>>>
> >>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
> >>>>>> but sometimes it would lead to inefficient behaviors.
> >>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
> >>>>>> all pages for 64MB would be merged to a bio structure if memory address is
> >>>>>> continued phsycally. it makes some delay to submit until merge complete.
> >>>>>> bio max size should be limited as a proper size.
> >>>>>
> >>>>> But merging physically contiguous pages into the same bvec + later automatic bio
> >>>>> split on submit should give you better throughput for large IOs compared to
> >>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
> >>>>> likely need splitting anyway (because of DMA boundaries etc).
> >>>>>
> >>>>> Do you have a specific case where you see higher performance with this patch
> >>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
> >>>>> considering that many hardware can execute larger IOs than that.
> >>>>>
> >>>>
> >>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
> >>>> is merged into a bio structure.
> >>>> And elapsed time to merge complete was about 2ms.
> >>>> It means first bio-submit is after 2ms.
> >>>> If bio size is limited with 1MB with this patch, first bio-submit is about
> >>>> 100us by bio_full operation.
> >>>
> >>> bio_submit() will split the large BIO case into multiple requests while the
> >>> small BIO case will likely result one or two requests only. That likely explain
> >>> the time difference here. However, for the large case, the 2ms will issue ALL
> >>> requests needed for processing the entire 32MB user IO while the 1MB bio case
> >>> will need 32 different bio_submit() calls. So what is the actual total latency
> >>> difference for the entire 32MB user IO ? That is I think what needs to be
> >>> compared here.
> >>>
> >>> Also, what is your device max_sectors_kb and max queue depth ?
> >>>
> >>
> >> 32MB total latency is about 19ms including merge time without this patch.
> >> But with this patch, total latency is about 17ms including merge time too.
> > 
> > 19ms looks too big just for preparing one 32MB sized bio, which isn't
> > supposed to
> > take so long.  Can you investigate where the 19ms is taken just for
> > preparing one
> > 32MB sized bio?
> 
> Changheun mentioned that the device side IO latency is 16.7ms out of the 19ms
> total. So the BIO handling, submission+completion takes about 2.3ms, and
> Changheun points above to 2ms for the submission part.

OK, looks I misunderstood the data.

> 
> > 
> > It might be iov_iter_get_pages() for handling page fault. If yes, one suggestion
> > is to enable THP(Transparent HugePage Support) in your application.
> 
> But if that was due to page faults, the same large-ish time would be taken for
> the preparing the size-limited BIOs too, no ? No matter how the BIOs are diced,
> all 32MB of pages of the user IO are referenced...

If bio size is reduced to 1MB, just 256 pages need to be faulted before submitting this
bio, instead of 256*32 pages, that is why the following words are mentioned:

	It means first bio-submit is after 2ms.
	If bio size is limited with 1MB with this patch, first bio-submit is about
	100us by bio_full operation.


Thanks,
Ming
Damien Le Moal Jan. 13, 2021, 11:16 a.m. UTC | #11
On 2021/01/13 19:25, Ming Lei wrote:
> On Wed, Jan 13, 2021 at 09:28:02AM +0000, Damien Le Moal wrote:
>> On 2021/01/13 18:19, Ming Lei wrote:
>>> On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee <nanich.lee@samsung.com> wrote:
>>>>
>>>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
>>>>>>>>
>>>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>>>>> all pages for 64MB would be merged to a bio structure if memory address is
>>>>>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>>>>>> bio max size should be limited as a proper size.
>>>>>>>
>>>>>>> But merging physically contiguous pages into the same bvec + later automatic bio
>>>>>>> split on submit should give you better throughput for large IOs compared to
>>>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>>>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>>>
>>>>>>> Do you have a specific case where you see higher performance with this patch
>>>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>>>>>>> considering that many hardware can execute larger IOs than that.
>>>>>>>
>>>>>>
>>>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>>>>> is merged into a bio structure.
>>>>>> And elapsed time to merge complete was about 2ms.
>>>>>> It means first bio-submit is after 2ms.
>>>>>> If bio size is limited with 1MB with this patch, first bio-submit is about
>>>>>> 100us by bio_full operation.
>>>>>
>>>>> bio_submit() will split the large BIO case into multiple requests while the
>>>>> small BIO case will likely result one or two requests only. That likely explain
>>>>> the time difference here. However, for the large case, the 2ms will issue ALL
>>>>> requests needed for processing the entire 32MB user IO while the 1MB bio case
>>>>> will need 32 different bio_submit() calls. So what is the actual total latency
>>>>> difference for the entire 32MB user IO ? That is I think what needs to be
>>>>> compared here.
>>>>>
>>>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>>>
>>>>
>>>> 32MB total latency is about 19ms including merge time without this patch.
>>>> But with this patch, total latency is about 17ms including merge time too.
>>>
>>> 19ms looks too big just for preparing one 32MB sized bio, which isn't
>>> supposed to
>>> take so long.  Can you investigate where the 19ms is taken just for
>>> preparing one
>>> 32MB sized bio?
>>
>> Changheun mentioned that the device side IO latency is 16.7ms out of the 19ms
>> total. So the BIO handling, submission+completion takes about 2.3ms, and
>> Changheun points above to 2ms for the submission part.
> 
> OK, looks I misunderstood the data.
> 
>>
>>>
>>> It might be iov_iter_get_pages() for handling page fault. If yes, one suggestion
>>> is to enable THP(Transparent HugePage Support) in your application.
>>
>> But if that was due to page faults, the same large-ish time would be taken for
>> the preparing the size-limited BIOs too, no ? No matter how the BIOs are diced,
>> all 32MB of pages of the user IO are referenced...
> 
> If bio size is reduced to 1MB, just 256 pages need to be faulted before submitting this
> bio, instead of 256*32 pages, that is why the following words are mentioned:
> 
> 	It means first bio-submit is after 2ms.
> 	If bio size is limited with 1MB with this patch, first bio-submit is about
> 	100us by bio_full operation.

Yes, but eventually, all pages for the 32MB IO will be faulted in, just not in
one go. Overall number of page faults is likely the same as with the large BIO
preparation. So I think we are back to my previous point, that is, reducing the
device idle time by starting a BIO more quickly, even a small one, leads to
overlap between CPU time needed for the next BIO preparation and previous BIO
execution, reducing overall the latency for the entire 32MB user IO. I don't
think that the reason is page faulting in itself.

> 
> 
> Thanks,
> Ming
> 
>
Ming Lei Jan. 13, 2021, 11:47 a.m. UTC | #12
On Wed, Jan 13, 2021 at 11:16:11AM +0000, Damien Le Moal wrote:
> On 2021/01/13 19:25, Ming Lei wrote:
> > On Wed, Jan 13, 2021 at 09:28:02AM +0000, Damien Le Moal wrote:
> >> On 2021/01/13 18:19, Ming Lei wrote:
> >>> On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee <nanich.lee@samsung.com> wrote:
> >>>>
> >>>>> On 2021/01/12 21:14, Changheun Lee wrote:
> >>>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
> >>>>>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
> >>>>>>>>
> >>>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
> >>>>>>>> but sometimes it would lead to inefficient behaviors.
> >>>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
> >>>>>>>> all pages for 64MB would be merged to a bio structure if memory address is
> >>>>>>>> continued phsycally. it makes some delay to submit until merge complete.
> >>>>>>>> bio max size should be limited as a proper size.
> >>>>>>>
> >>>>>>> But merging physically contiguous pages into the same bvec + later automatic bio
> >>>>>>> split on submit should give you better throughput for large IOs compared to
> >>>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
> >>>>>>> likely need splitting anyway (because of DMA boundaries etc).
> >>>>>>>
> >>>>>>> Do you have a specific case where you see higher performance with this patch
> >>>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
> >>>>>>> considering that many hardware can execute larger IOs than that.
> >>>>>>>
> >>>>>>
> >>>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
> >>>>>> is merged into a bio structure.
> >>>>>> And elapsed time to merge complete was about 2ms.
> >>>>>> It means first bio-submit is after 2ms.
> >>>>>> If bio size is limited with 1MB with this patch, first bio-submit is about
> >>>>>> 100us by bio_full operation.
> >>>>>
> >>>>> bio_submit() will split the large BIO case into multiple requests while the
> >>>>> small BIO case will likely result one or two requests only. That likely explain
> >>>>> the time difference here. However, for the large case, the 2ms will issue ALL
> >>>>> requests needed for processing the entire 32MB user IO while the 1MB bio case
> >>>>> will need 32 different bio_submit() calls. So what is the actual total latency
> >>>>> difference for the entire 32MB user IO ? That is I think what needs to be
> >>>>> compared here.
> >>>>>
> >>>>> Also, what is your device max_sectors_kb and max queue depth ?
> >>>>>
> >>>>
> >>>> 32MB total latency is about 19ms including merge time without this patch.
> >>>> But with this patch, total latency is about 17ms including merge time too.
> >>>
> >>> 19ms looks too big just for preparing one 32MB sized bio, which isn't
> >>> supposed to
> >>> take so long.  Can you investigate where the 19ms is taken just for
> >>> preparing one
> >>> 32MB sized bio?
> >>
> >> Changheun mentioned that the device side IO latency is 16.7ms out of the 19ms
> >> total. So the BIO handling, submission+completion takes about 2.3ms, and
> >> Changheun points above to 2ms for the submission part.
> > 
> > OK, looks I misunderstood the data.
> > 
> >>
> >>>
> >>> It might be iov_iter_get_pages() for handling page fault. If yes, one suggestion
> >>> is to enable THP(Transparent HugePage Support) in your application.
> >>
> >> But if that was due to page faults, the same large-ish time would be taken for
> >> the preparing the size-limited BIOs too, no ? No matter how the BIOs are diced,
> >> all 32MB of pages of the user IO are referenced...
> > 
> > If bio size is reduced to 1MB, just 256 pages need to be faulted before submitting this
> > bio, instead of 256*32 pages, that is why the following words are mentioned:
> > 
> > 	It means first bio-submit is after 2ms.
> > 	If bio size is limited with 1MB with this patch, first bio-submit is about
> > 	100us by bio_full operation.
> 
> Yes, but eventually, all pages for the 32MB IO will be faulted in, just not in
> one go. Overall number of page faults is likely the same as with the large BIO
> preparation. So I think we are back to my previous point, that is, reducing the
> device idle time by starting a BIO more quickly, even a small one, leads to
> overlap between CPU time needed for the next BIO preparation and previous BIO
> execution, reducing overall the latency for the entire 32MB user IO.

When bio size is reduced from 32M to 1M:

1MB/(P(1M) + D(1M)) may become bigger than 32MB/(P(1M) + D(1M)), so
throughput is improved.

P(x) means time for preparing 'x' sized IO
D(x) means time for device to handle 'x' sized IO 

I depend on both CPU and the UFS drive.

> I don't think that the reason is page faulting in itself.

What I meant is that page faulting might contribute most part of the
100us(preparing 1MB data) and 2ms(preparing 32MB data). It can be others,
but should be easy to figure out.
Damien Le Moal Jan. 13, 2021, 12:02 p.m. UTC | #13
On 2021/01/13 20:48, Ming Lei wrote:
> On Wed, Jan 13, 2021 at 11:16:11AM +0000, Damien Le Moal wrote:
>> On 2021/01/13 19:25, Ming Lei wrote:
>>> On Wed, Jan 13, 2021 at 09:28:02AM +0000, Damien Le Moal wrote:
>>>> On 2021/01/13 18:19, Ming Lei wrote:
>>>>> On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee <nanich.lee@samsung.com> wrote:
>>>>>>
>>>>>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>>>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
>>>>>>>>>>
>>>>>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>>>>>>> all pages for 64MB would be merged to a bio structure if memory address is
>>>>>>>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>>>>>>>> bio max size should be limited as a proper size.
>>>>>>>>>
>>>>>>>>> But merging physically contiguous pages into the same bvec + later automatic bio
>>>>>>>>> split on submit should give you better throughput for large IOs compared to
>>>>>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>>>>>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>>>>>
>>>>>>>>> Do you have a specific case where you see higher performance with this patch
>>>>>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>>>>>>>>> considering that many hardware can execute larger IOs than that.
>>>>>>>>>
>>>>>>>>
>>>>>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>>>>>>> is merged into a bio structure.
>>>>>>>> And elapsed time to merge complete was about 2ms.
>>>>>>>> It means first bio-submit is after 2ms.
>>>>>>>> If bio size is limited with 1MB with this patch, first bio-submit is about
>>>>>>>> 100us by bio_full operation.
>>>>>>>
>>>>>>> bio_submit() will split the large BIO case into multiple requests while the
>>>>>>> small BIO case will likely result one or two requests only. That likely explain
>>>>>>> the time difference here. However, for the large case, the 2ms will issue ALL
>>>>>>> requests needed for processing the entire 32MB user IO while the 1MB bio case
>>>>>>> will need 32 different bio_submit() calls. So what is the actual total latency
>>>>>>> difference for the entire 32MB user IO ? That is I think what needs to be
>>>>>>> compared here.
>>>>>>>
>>>>>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>>>>>
>>>>>>
>>>>>> 32MB total latency is about 19ms including merge time without this patch.
>>>>>> But with this patch, total latency is about 17ms including merge time too.
>>>>>
>>>>> 19ms looks too big just for preparing one 32MB sized bio, which isn't
>>>>> supposed to
>>>>> take so long.  Can you investigate where the 19ms is taken just for
>>>>> preparing one
>>>>> 32MB sized bio?
>>>>
>>>> Changheun mentioned that the device side IO latency is 16.7ms out of the 19ms
>>>> total. So the BIO handling, submission+completion takes about 2.3ms, and
>>>> Changheun points above to 2ms for the submission part.
>>>
>>> OK, looks I misunderstood the data.
>>>
>>>>
>>>>>
>>>>> It might be iov_iter_get_pages() for handling page fault. If yes, one suggestion
>>>>> is to enable THP(Transparent HugePage Support) in your application.
>>>>
>>>> But if that was due to page faults, the same large-ish time would be taken for
>>>> the preparing the size-limited BIOs too, no ? No matter how the BIOs are diced,
>>>> all 32MB of pages of the user IO are referenced...
>>>
>>> If bio size is reduced to 1MB, just 256 pages need to be faulted before submitting this
>>> bio, instead of 256*32 pages, that is why the following words are mentioned:
>>>
>>> 	It means first bio-submit is after 2ms.
>>> 	If bio size is limited with 1MB with this patch, first bio-submit is about
>>> 	100us by bio_full operation.
>>
>> Yes, but eventually, all pages for the 32MB IO will be faulted in, just not in
>> one go. Overall number of page faults is likely the same as with the large BIO
>> preparation. So I think we are back to my previous point, that is, reducing the
>> device idle time by starting a BIO more quickly, even a small one, leads to
>> overlap between CPU time needed for the next BIO preparation and previous BIO
>> execution, reducing overall the latency for the entire 32MB user IO.
> 
> When bio size is reduced from 32M to 1M:
> 
> 1MB/(P(1M) + D(1M)) may become bigger than 32MB/(P(1M) + D(1M)), so
> throughput is improved.

I think that the reason is that P(1M) < D(1M) and so there is overlap between P
and D: P of the next BIO is done on the CPU while D of the previous BIO is
ongoing on the device, assuming there is no plugging.

Somewhat the same should happen for the 32MB case too since that would still
lead to a D(329 generating 32 x P(1). And again assuming that there is no plug,
the first 1M BIO generated by the BIO split in D(32) should start executing
right away while D(329 split continues on the CPU. I guess the fragments are not
issued (a plug or an IO sched holding them ?) so the overall latency becomes
D(32) + 32 x P(1). Leading to the 2ms D(32) difference in latency that Changheun
observed. A blktrace could confirm that I guess.

> 
> P(x) means time for preparing 'x' sized IO
> D(x) means time for device to handle 'x' sized IO >
> I depend on both CPU and the UFS drive.
> 
>> I don't think that the reason is page faulting in itself.
> 
> What I meant is that page faulting might contribute most part of the
> 100us(preparing 1MB data) and 2ms(preparing 32MB data). It can be others,
> but should be easy to figure out.

I may be wrong here (need to check again), but I though the pages were faulted
in at get_user_pages() time, that is, for the 32MB user buffer, regardless of
the BIO size limit. If that's the case, then page fault should not be a
significant factor in the latency difference.
Ming Lei Jan. 14, 2021, 3:52 a.m. UTC | #14
On Wed, Jan 13, 2021 at 12:02:44PM +0000, Damien Le Moal wrote:
> On 2021/01/13 20:48, Ming Lei wrote:
> > On Wed, Jan 13, 2021 at 11:16:11AM +0000, Damien Le Moal wrote:
> >> On 2021/01/13 19:25, Ming Lei wrote:
> >>> On Wed, Jan 13, 2021 at 09:28:02AM +0000, Damien Le Moal wrote:
> >>>> On 2021/01/13 18:19, Ming Lei wrote:
> >>>>> On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee <nanich.lee@samsung.com> wrote:
> >>>>>>
> >>>>>>> On 2021/01/12 21:14, Changheun Lee wrote:
> >>>>>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
> >>>>>>>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
> >>>>>>>>>>
> >>>>>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
> >>>>>>>>>> but sometimes it would lead to inefficient behaviors.
> >>>>>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
> >>>>>>>>>> all pages for 64MB would be merged to a bio structure if memory address is
> >>>>>>>>>> continued phsycally. it makes some delay to submit until merge complete.
> >>>>>>>>>> bio max size should be limited as a proper size.
> >>>>>>>>>
> >>>>>>>>> But merging physically contiguous pages into the same bvec + later automatic bio
> >>>>>>>>> split on submit should give you better throughput for large IOs compared to
> >>>>>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
> >>>>>>>>> likely need splitting anyway (because of DMA boundaries etc).
> >>>>>>>>>
> >>>>>>>>> Do you have a specific case where you see higher performance with this patch
> >>>>>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
> >>>>>>>>> considering that many hardware can execute larger IOs than that.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
> >>>>>>>> is merged into a bio structure.
> >>>>>>>> And elapsed time to merge complete was about 2ms.
> >>>>>>>> It means first bio-submit is after 2ms.
> >>>>>>>> If bio size is limited with 1MB with this patch, first bio-submit is about
> >>>>>>>> 100us by bio_full operation.
> >>>>>>>
> >>>>>>> bio_submit() will split the large BIO case into multiple requests while the
> >>>>>>> small BIO case will likely result one or two requests only. That likely explain
> >>>>>>> the time difference here. However, for the large case, the 2ms will issue ALL
> >>>>>>> requests needed for processing the entire 32MB user IO while the 1MB bio case
> >>>>>>> will need 32 different bio_submit() calls. So what is the actual total latency
> >>>>>>> difference for the entire 32MB user IO ? That is I think what needs to be
> >>>>>>> compared here.
> >>>>>>>
> >>>>>>> Also, what is your device max_sectors_kb and max queue depth ?
> >>>>>>>
> >>>>>>
> >>>>>> 32MB total latency is about 19ms including merge time without this patch.
> >>>>>> But with this patch, total latency is about 17ms including merge time too.
> >>>>>
> >>>>> 19ms looks too big just for preparing one 32MB sized bio, which isn't
> >>>>> supposed to
> >>>>> take so long.  Can you investigate where the 19ms is taken just for
> >>>>> preparing one
> >>>>> 32MB sized bio?
> >>>>
> >>>> Changheun mentioned that the device side IO latency is 16.7ms out of the 19ms
> >>>> total. So the BIO handling, submission+completion takes about 2.3ms, and
> >>>> Changheun points above to 2ms for the submission part.
> >>>
> >>> OK, looks I misunderstood the data.
> >>>
> >>>>
> >>>>>
> >>>>> It might be iov_iter_get_pages() for handling page fault. If yes, one suggestion
> >>>>> is to enable THP(Transparent HugePage Support) in your application.
> >>>>
> >>>> But if that was due to page faults, the same large-ish time would be taken for
> >>>> the preparing the size-limited BIOs too, no ? No matter how the BIOs are diced,
> >>>> all 32MB of pages of the user IO are referenced...
> >>>
> >>> If bio size is reduced to 1MB, just 256 pages need to be faulted before submitting this
> >>> bio, instead of 256*32 pages, that is why the following words are mentioned:
> >>>
> >>> 	It means first bio-submit is after 2ms.
> >>> 	If bio size is limited with 1MB with this patch, first bio-submit is about
> >>> 	100us by bio_full operation.
> >>
> >> Yes, but eventually, all pages for the 32MB IO will be faulted in, just not in
> >> one go. Overall number of page faults is likely the same as with the large BIO
> >> preparation. So I think we are back to my previous point, that is, reducing the
> >> device idle time by starting a BIO more quickly, even a small one, leads to
> >> overlap between CPU time needed for the next BIO preparation and previous BIO
> >> execution, reducing overall the latency for the entire 32MB user IO.
> > 
> > When bio size is reduced from 32M to 1M:
> > 
> > 1MB/(P(1M) + D(1M)) may become bigger than 32MB/(P(1M) + D(1M)), so
> > throughput is improved.
> 
> I think that the reason is that P(1M) < D(1M) and so there is overlap between P
> and D: P of the next BIO is done on the CPU while D of the previous BIO is
> ongoing on the device, assuming there is no plugging.

Looks you are talking about AIO. IMO, if AIO is used in Changheun's
test, the UFS controller pipeline can be saturated easily by many
enough(> 8 or more) 32M requests(preparing each takes 2ms, and device need
16ms to handle 32MB req), then there shouldn't be such issue.

So I guess Changheun uses sync dio, and the 2ms preparing time is added
to bio submission delay every time.

Changheun, can you talk about your 32MB block size direct IO test in a
bit detail? AIO or sync dio? Do you have fio command line to reproduce
this issue?


Thanks, 
Ming
Damien Le Moal Jan. 14, 2021, 4 a.m. UTC | #15
On 2021/01/14 12:53, Ming Lei wrote:
> On Wed, Jan 13, 2021 at 12:02:44PM +0000, Damien Le Moal wrote:
>> On 2021/01/13 20:48, Ming Lei wrote:
>>> On Wed, Jan 13, 2021 at 11:16:11AM +0000, Damien Le Moal wrote:
>>>> On 2021/01/13 19:25, Ming Lei wrote:
>>>>> On Wed, Jan 13, 2021 at 09:28:02AM +0000, Damien Le Moal wrote:
>>>>>> On 2021/01/13 18:19, Ming Lei wrote:
>>>>>>> On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee <nanich.lee@samsung.com> wrote:
>>>>>>>>
>>>>>>>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>>>>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>>>>>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
>>>>>>>>>>>>
>>>>>>>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>>>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>>>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>>>>>>>>> all pages for 64MB would be merged to a bio structure if memory address is
>>>>>>>>>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>>>>>>>>>> bio max size should be limited as a proper size.
>>>>>>>>>>>
>>>>>>>>>>> But merging physically contiguous pages into the same bvec + later automatic bio
>>>>>>>>>>> split on submit should give you better throughput for large IOs compared to
>>>>>>>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>>>>>>>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>>>>>>>
>>>>>>>>>>> Do you have a specific case where you see higher performance with this patch
>>>>>>>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>>>>>>>>>>> considering that many hardware can execute larger IOs than that.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>>>>>>>>> is merged into a bio structure.
>>>>>>>>>> And elapsed time to merge complete was about 2ms.
>>>>>>>>>> It means first bio-submit is after 2ms.
>>>>>>>>>> If bio size is limited with 1MB with this patch, first bio-submit is about
>>>>>>>>>> 100us by bio_full operation.
>>>>>>>>>
>>>>>>>>> bio_submit() will split the large BIO case into multiple requests while the
>>>>>>>>> small BIO case will likely result one or two requests only. That likely explain
>>>>>>>>> the time difference here. However, for the large case, the 2ms will issue ALL
>>>>>>>>> requests needed for processing the entire 32MB user IO while the 1MB bio case
>>>>>>>>> will need 32 different bio_submit() calls. So what is the actual total latency
>>>>>>>>> difference for the entire 32MB user IO ? That is I think what needs to be
>>>>>>>>> compared here.
>>>>>>>>>
>>>>>>>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>>>>>>>
>>>>>>>>
>>>>>>>> 32MB total latency is about 19ms including merge time without this patch.
>>>>>>>> But with this patch, total latency is about 17ms including merge time too.
>>>>>>>
>>>>>>> 19ms looks too big just for preparing one 32MB sized bio, which isn't
>>>>>>> supposed to
>>>>>>> take so long.  Can you investigate where the 19ms is taken just for
>>>>>>> preparing one
>>>>>>> 32MB sized bio?
>>>>>>
>>>>>> Changheun mentioned that the device side IO latency is 16.7ms out of the 19ms
>>>>>> total. So the BIO handling, submission+completion takes about 2.3ms, and
>>>>>> Changheun points above to 2ms for the submission part.
>>>>>
>>>>> OK, looks I misunderstood the data.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> It might be iov_iter_get_pages() for handling page fault. If yes, one suggestion
>>>>>>> is to enable THP(Transparent HugePage Support) in your application.
>>>>>>
>>>>>> But if that was due to page faults, the same large-ish time would be taken for
>>>>>> the preparing the size-limited BIOs too, no ? No matter how the BIOs are diced,
>>>>>> all 32MB of pages of the user IO are referenced...
>>>>>
>>>>> If bio size is reduced to 1MB, just 256 pages need to be faulted before submitting this
>>>>> bio, instead of 256*32 pages, that is why the following words are mentioned:
>>>>>
>>>>> 	It means first bio-submit is after 2ms.
>>>>> 	If bio size is limited with 1MB with this patch, first bio-submit is about
>>>>> 	100us by bio_full operation.
>>>>
>>>> Yes, but eventually, all pages for the 32MB IO will be faulted in, just not in
>>>> one go. Overall number of page faults is likely the same as with the large BIO
>>>> preparation. So I think we are back to my previous point, that is, reducing the
>>>> device idle time by starting a BIO more quickly, even a small one, leads to
>>>> overlap between CPU time needed for the next BIO preparation and previous BIO
>>>> execution, reducing overall the latency for the entire 32MB user IO.
>>>
>>> When bio size is reduced from 32M to 1M:
>>>
>>> 1MB/(P(1M) + D(1M)) may become bigger than 32MB/(P(1M) + D(1M)), so
>>> throughput is improved.
>>
>> I think that the reason is that P(1M) < D(1M) and so there is overlap between P
>> and D: P of the next BIO is done on the CPU while D of the previous BIO is
>> ongoing on the device, assuming there is no plugging.
> 
> Looks you are talking about AIO. IMO, if AIO is used in Changheun's
> test, the UFS controller pipeline can be saturated easily by many
> enough(> 8 or more) 32M requests(preparing each takes 2ms, and device need
> 16ms to handle 32MB req), then there shouldn't be such issue.
> 
> So I guess Changheun uses sync dio, and the 2ms preparing time is added
> to bio submission delay every time.
> 
> Changheun, can you talk about your 32MB block size direct IO test in a
> bit detail? AIO or sync dio? Do you have fio command line to reproduce
> this issue?

Maybe also provide a blktrace output of for one 32MB IO execution ?

> 
> 
> Thanks, 
> Ming
> 
>
Changheun Lee Jan. 14, 2021, 4:35 a.m. UTC | #16
>On 2021/01/14 12:53, Ming Lei wrote:
>> On Wed, Jan 13, 2021 at 12:02:44PM +0000, Damien Le Moal wrote:
>>> On 2021/01/13 20:48, Ming Lei wrote:
>>>> On Wed, Jan 13, 2021 at 11:16:11AM +0000, Damien Le Moal wrote:
>>>>> On 2021/01/13 19:25, Ming Lei wrote:
>>>>>> On Wed, Jan 13, 2021 at 09:28:02AM +0000, Damien Le Moal wrote:
>>>>>>> On 2021/01/13 18:19, Ming Lei wrote:
>>>>>>>> On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee <nanich.lee@samsung.com> wrote:
>>>>>>>>>
>>>>>>>>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>>>>>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>>>>>>>>> From: "Changheun Lee" <nanich.lee@samsung.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>>>>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>>>>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>>>>>>>>>> all pages for 64MB would be merged to a bio structure if memory address is
>>>>>>>>>>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>>>>>>>>>>> bio max size should be limited as a proper size.
>>>>>>>>>>>>
>>>>>>>>>>>> But merging physically contiguous pages into the same bvec + later automatic bio
>>>>>>>>>>>> split on submit should give you better throughput for large IOs compared to
>>>>>>>>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>>>>>>>>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>>>>>>>>
>>>>>>>>>>>> Do you have a specific case where you see higher performance with this patch
>>>>>>>>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>>>>>>>>>>>> considering that many hardware can execute larger IOs than that.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>>>>>>>>>> is merged into a bio structure.
>>>>>>>>>>> And elapsed time to merge complete was about 2ms.
>>>>>>>>>>> It means first bio-submit is after 2ms.
>>>>>>>>>>> If bio size is limited with 1MB with this patch, first bio-submit is about
>>>>>>>>>>> 100us by bio_full operation.
>>>>>>>>>>
>>>>>>>>>> bio_submit() will split the large BIO case into multiple requests while the
>>>>>>>>>> small BIO case will likely result one or two requests only. That likely explain
>>>>>>>>>> the time difference here. However, for the large case, the 2ms will issue ALL
>>>>>>>>>> requests needed for processing the entire 32MB user IO while the 1MB bio case
>>>>>>>>>> will need 32 different bio_submit() calls. So what is the actual total latency
>>>>>>>>>> difference for the entire 32MB user IO ? That is I think what needs to be
>>>>>>>>>> compared here.
>>>>>>>>>>
>>>>>>>>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 32MB total latency is about 19ms including merge time without this patch.
>>>>>>>>> But with this patch, total latency is about 17ms including merge time too.
>>>>>>>>
>>>>>>>> 19ms looks too big just for preparing one 32MB sized bio, which isn't
>>>>>>>> supposed to
>>>>>>>> take so long.  Can you investigate where the 19ms is taken just for
>>>>>>>> preparing one
>>>>>>>> 32MB sized bio?
>>>>>>>
>>>>>>> Changheun mentioned that the device side IO latency is 16.7ms out of the 19ms
>>>>>>> total. So the BIO handling, submission+completion takes about 2.3ms, and
>>>>>>> Changheun points above to 2ms for the submission part.
>>>>>>
>>>>>> OK, looks I misunderstood the data.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> It might be iov_iter_get_pages() for handling page fault. If yes, one suggestion
>>>>>>>> is to enable THP(Transparent HugePage Support) in your application.
>>>>>>>
>>>>>>> But if that was due to page faults, the same large-ish time would be taken for
>>>>>>> the preparing the size-limited BIOs too, no ? No matter how the BIOs are diced,
>>>>>>> all 32MB of pages of the user IO are referenced...
>>>>>>
>>>>>> If bio size is reduced to 1MB, just 256 pages need to be faulted before submitting this
>>>>>> bio, instead of 256*32 pages, that is why the following words are mentioned:
>>>>>>
>>>>>> 	It means first bio-submit is after 2ms.
>>>>>> 	If bio size is limited with 1MB with this patch, first bio-submit is about
>>>>>> 	100us by bio_full operation.
>>>>>
>>>>> Yes, but eventually, all pages for the 32MB IO will be faulted in, just not in
>>>>> one go. Overall number of page faults is likely the same as with the large BIO
>>>>> preparation. So I think we are back to my previous point, that is, reducing the
>>>>> device idle time by starting a BIO more quickly, even a small one, leads to
>>>>> overlap between CPU time needed for the next BIO preparation and previous BIO
>>>>> execution, reducing overall the latency for the entire 32MB user IO.
>>>>
>>>> When bio size is reduced from 32M to 1M:
>>>>
>>>> 1MB/(P(1M) + D(1M)) may become bigger than 32MB/(P(1M) + D(1M)), so
>>>> throughput is improved.
>>>
>>> I think that the reason is that P(1M) < D(1M) and so there is overlap between P
>>> and D: P of the next BIO is done on the CPU while D of the previous BIO is
>>> ongoing on the device, assuming there is no plugging.
>> 
>> Looks you are talking about AIO. IMO, if AIO is used in Changheun's
>> test, the UFS controller pipeline can be saturated easily by many
>> enough(> 8 or more) 32M requests(preparing each takes 2ms, and device need
>> 16ms to handle 32MB req), then there shouldn't be such issue.
>> 
>> So I guess Changheun uses sync dio, and the 2ms preparing time is added
>> to bio submission delay every time.
>> 
>> Changheun, can you talk about your 32MB block size direct IO test in a
>> bit detail? AIO or sync dio? Do you have fio command line to reproduce
>> this issue?
>
>Maybe also provide a blktrace output of for one 32MB IO execution ?

When 32MB chunk read with direct I/O option is comming from userspace,
kernel behavior is below now. it's timeline.

 | bio merge for 32MB. total 8,192 pages are merged.
 | total elapsed time is over 2ms.
 |------------------ ... ----------------------->|
                                                 | 8,192 pages merged a bio.
                                                 | at this time, first bio submit is done.
                                                 | 1 bio is split to 32 read request and issue.
                                                 |--------------->
                                                  |--------------->
                                                   |--------------->
                                                              ......
                                                                   |--------------->
                                                                    |--------------->|
                                                                                     | 
                          total 19ms elapsed to complete 32MB read done from device. |

If bio max size is limited with 1MB, behavior is changed below.

 | bio merge for 1MB. 256 pages are merged for each bio.
 | total 32 bio will be made.
 | total elapsed time is over 2ms. it's same.
 | but, first bio submit timing is fast. about 100us.
 |--->|--->|--->|---> ... -->|--->|--->|--->|--->|
      | 256 pages merged a bio.
      | at this time, first bio submit is done.
      | and 1 read request is issued for 1 bio.
      |--------------->
           |--------------->
                |--------------->
                                      ......
                                                 |--------------->
                                                  |--------------->|
                                                                   | 
        total 17ms elapsed to complete 32MB read done from device. | 


As a result, read request issue timing is faster if bio max size is limited.
Current kernel behavior with multipage bvec, super large bio can be created.
And it lead to delay first I/O request issue.

>
>> 
>> 
>> Thanks, 
>> Ming
>> 
>> 
>
>
>-- 
>Damien Le Moal
>Western Digital Research
>
diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index 1f2cc1fbe283..dbe14d675f28 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -877,7 +877,7 @@  bool __bio_try_merge_page(struct bio *bio, struct page *page,
 		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
 
 		if (page_is_mergeable(bv, page, len, off, same_page)) {
-			if (bio->bi_iter.bi_size > UINT_MAX - len) {
+			if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
 				*same_page = false;
 				return false;
 			}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1edda614f7ce..0f49b354b1f6 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -20,6 +20,7 @@ 
 #endif
 
 #define BIO_MAX_PAGES		256
+#define BIO_MAX_SIZE		(BIO_MAX_PAGES * PAGE_SIZE)
 
 #define bio_prio(bio)			(bio)->bi_ioprio
 #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
@@ -113,7 +114,7 @@  static inline bool bio_full(struct bio *bio, unsigned len)
 	if (bio->bi_vcnt >= bio->bi_max_vecs)
 		return true;
 
-	if (bio->bi_iter.bi_size > UINT_MAX - len)
+	if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
 		return true;
 
 	return false;