diff mbox series

[RFC,2/2] block: add a fast path for seg split of large bio

Message ID 53b86d4e86c4913658cb0f472dcc3e22ef75396b.1609875589.git.asml.silence@gmail.com (mailing list archive)
State New, archived
Headers show
Series optimise split bio | expand

Commit Message

Pavel Begunkov Jan. 5, 2021, 7:43 p.m. UTC
blk_bio_segment_split() is very heavy, but the current fast path covers
only one-segment under PAGE_SIZE bios. Add another one by estimating an
upper bound of sectors a bio can contain.

One restricting factor here is queue_max_segment_size(), which it
compare against full iter size to not dig into bvecs. By default it's
64KB, and so for requests under 64KB, but for those falling under the
conditions it's much faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 block/blk-merge.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

Comments

Pavel Begunkov Jan. 5, 2021, 8:36 p.m. UTC | #1
On 05/01/2021 19:43, Pavel Begunkov wrote:
> blk_bio_segment_split() is very heavy, but the current fast path covers
> only one-segment under PAGE_SIZE bios. Add another one by estimating an
> upper bound of sectors a bio can contain.
> 
> One restricting factor here is queue_max_segment_size(), which it
> compare against full iter size to not dig into bvecs. By default it's
> 64KB, and so for requests under 64KB, but for those falling under the
> conditions it's much faster.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  block/blk-merge.c | 29 +++++++++++++++++++++++++----
>  1 file changed, 25 insertions(+), 4 deletions(-)
> 
[...]
>  
> -	return __blk_bio_segment_split(q, bio, bs, nr_segs);
> +	q_max_sectors = get_max_io_size(q, bio);
> +	if (!queue_virt_boundary(q) && bio_segs < queue_max_segments(q) &&
> +	    bio->bi_iter.bi_size <= queue_max_segment_size(q)) {

I think I miss a seg_boundary_mask check here. Any insights how to skip it?
Or why it's 2^31-1 by default, but not say ((unsigned long)-1)?
Christoph Hellwig Jan. 27, 2021, 5:16 p.m. UTC | #2
On Tue, Jan 05, 2021 at 07:43:38PM +0000, Pavel Begunkov wrote:
> blk_bio_segment_split() is very heavy, but the current fast path covers
> only one-segment under PAGE_SIZE bios. Add another one by estimating an
> upper bound of sectors a bio can contain.
> 
> One restricting factor here is queue_max_segment_size(), which it
> compare against full iter size to not dig into bvecs. By default it's
> 64KB, and so for requests under 64KB, but for those falling under the
> conditions it's much faster.

I think this works, but it is a pretty gross heuristic, which also
doesn't help us with NVMe, which is the I/O fast path of choice for
most people.  What is your use/test case?

> +		/*
> +		 * Segments are contiguous, so only their ends may be not full.
> +		 * An upper bound for them would to assume that each takes 1B
> +		 * but adds a sector, and all left are just full sectors.
> +		 * Note: it's ok to round size down because all not full
> +		 * sectors are accounted by the first term.
> +		 */
> +		max_sectors = bio_segs * 2;
> +		max_sectors += bio->bi_iter.bi_size >> 9;
> +
> +		if (max_sectors < q_max_sectors) {

I don't think we need the max_sectors variable here.
Pavel Begunkov Jan. 28, 2021, 11:56 a.m. UTC | #3
On 27/01/2021 17:16, Christoph Hellwig wrote:
> On Tue, Jan 05, 2021 at 07:43:38PM +0000, Pavel Begunkov wrote:
>> blk_bio_segment_split() is very heavy, but the current fast path covers
>> only one-segment under PAGE_SIZE bios. Add another one by estimating an
>> upper bound of sectors a bio can contain.
>>
>> One restricting factor here is queue_max_segment_size(), which it
>> compare against full iter size to not dig into bvecs. By default it's
>> 64KB, and so for requests under 64KB, but for those falling under the
>> conditions it's much faster.
> 
> I think this works, but it is a pretty gross heuristic, which also

bio->bi_iter.bi_size <= queue_max_segment_size(q)

Do you mean this, right? I wouldn't say it's gross, but _very_ loose.

> doesn't help us with NVMe, which is the I/O fast path of choice for
> most people.  What is your use/test case?

Yeah, the idea to make it work for NVMe. Don't remember it restricting
segment size or whatever, maybe only atomicity. Which condition do you
see problematic? I can't recall without opening the spec.

> 
>> +		/*
>> +		 * Segments are contiguous, so only their ends may be not full.
>> +		 * An upper bound for them would to assume that each takes 1B
>> +		 * but adds a sector, and all left are just full sectors.
>> +		 * Note: it's ok to round size down because all not full
>> +		 * sectors are accounted by the first term.
>> +		 */
>> +		max_sectors = bio_segs * 2;
>> +		max_sectors += bio->bi_iter.bi_size >> 9;
>> +
>> +		if (max_sectors < q_max_sectors) {
> 
> I don't think we need the max_sectors variable here.

Even more, considering that it's already sector aligned we can kill off
all this section and just use (bi_iter.bi_size >> 9).
Ming Lei Jan. 28, 2021, 12:10 p.m. UTC | #4
On Tue, Jan 05, 2021 at 07:43:38PM +0000, Pavel Begunkov wrote:
> blk_bio_segment_split() is very heavy, but the current fast path covers
> only one-segment under PAGE_SIZE bios. Add another one by estimating an
> upper bound of sectors a bio can contain.
> 
> One restricting factor here is queue_max_segment_size(), which it
> compare against full iter size to not dig into bvecs. By default it's
> 64KB, and so for requests under 64KB, but for those falling under the
> conditions it's much faster.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  block/blk-merge.c | 29 +++++++++++++++++++++++++----
>  1 file changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 84b9635b5d57..15d75f3ffc30 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -226,12 +226,12 @@ static bool bvec_split_segs(const struct request_queue *q,
>  static struct bio *__blk_bio_segment_split(struct request_queue *q,
>  					   struct bio *bio,
>  					   struct bio_set *bs,
> -					   unsigned *segs)
> +					   unsigned *segs,
> +					   const unsigned max_sectors)
>  {
>  	struct bio_vec bv, bvprv, *bvprvp = NULL;
>  	struct bvec_iter iter;
>  	unsigned nsegs = 0, sectors = 0;
> -	const unsigned max_sectors = get_max_io_size(q, bio);
>  	const unsigned max_segs = queue_max_segments(q);
>  
>  	bio_for_each_bvec(bv, bio, iter) {
> @@ -295,6 +295,9 @@ static inline struct bio *blk_bio_segment_split(struct request_queue *q,
>  						struct bio_set *bs,
>  						unsigned *nr_segs)
>  {
> +	unsigned int max_sectors, q_max_sectors;
> +	unsigned int bio_segs = bio->bi_vcnt;
> +
>  	/*
>  	 * All drivers must accept single-segments bios that are <=
>  	 * PAGE_SIZE.  This is a quick and dirty check that relies on
> @@ -303,14 +306,32 @@ static inline struct bio *blk_bio_segment_split(struct request_queue *q,
>  	 * are cloned, but compared to the performance impact of cloned
>  	 * bios themselves the loop below doesn't matter anyway.
>  	 */
> -	if (!q->limits.chunk_sectors && bio->bi_vcnt == 1 &&
> +	if (!q->limits.chunk_sectors && bio_segs == 1 &&
>  	    (bio->bi_io_vec[0].bv_len +
>  	     bio->bi_io_vec[0].bv_offset) <= PAGE_SIZE) {
>  		*nr_segs = 1;
>  		return NULL;
>  	}
>  
> -	return __blk_bio_segment_split(q, bio, bs, nr_segs);
> +	q_max_sectors = get_max_io_size(q, bio);
> +	if (!queue_virt_boundary(q) && bio_segs < queue_max_segments(q) &&
> +	    bio->bi_iter.bi_size <= queue_max_segment_size(q)) {

.bi_vcnt is 0 for fast cloned bio, so the above check may become true
when real nr_segment is > queue_max_segments(), especially in case that
max segments limit is small and segment size is big.
Pavel Begunkov Jan. 28, 2021, 12:27 p.m. UTC | #5
On 28/01/2021 12:10, Ming Lei wrote:
> On Tue, Jan 05, 2021 at 07:43:38PM +0000, Pavel Begunkov wrote:
>> blk_bio_segment_split() is very heavy, but the current fast path covers
>> only one-segment under PAGE_SIZE bios. Add another one by estimating an
>> upper bound of sectors a bio can contain.
>>
>> One restricting factor here is queue_max_segment_size(), which it
>> compare against full iter size to not dig into bvecs. By default it's
>> 64KB, and so for requests under 64KB, but for those falling under the
>> conditions it's much faster.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>  block/blk-merge.c | 29 +++++++++++++++++++++++++----
>>  1 file changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 84b9635b5d57..15d75f3ffc30 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -226,12 +226,12 @@ static bool bvec_split_segs(const struct request_queue *q,
>>  static struct bio *__blk_bio_segment_split(struct request_queue *q,
>>  					   struct bio *bio,
>>  					   struct bio_set *bs,
>> -					   unsigned *segs)
>> +					   unsigned *segs,
>> +					   const unsigned max_sectors)
>>  {
>>  	struct bio_vec bv, bvprv, *bvprvp = NULL;
>>  	struct bvec_iter iter;
>>  	unsigned nsegs = 0, sectors = 0;
>> -	const unsigned max_sectors = get_max_io_size(q, bio);
>>  	const unsigned max_segs = queue_max_segments(q);
>>  
>>  	bio_for_each_bvec(bv, bio, iter) {
>> @@ -295,6 +295,9 @@ static inline struct bio *blk_bio_segment_split(struct request_queue *q,
>>  						struct bio_set *bs,
>>  						unsigned *nr_segs)
>>  {
>> +	unsigned int max_sectors, q_max_sectors;
>> +	unsigned int bio_segs = bio->bi_vcnt;
>> +
>>  	/*
>>  	 * All drivers must accept single-segments bios that are <=
>>  	 * PAGE_SIZE.  This is a quick and dirty check that relies on
>> @@ -303,14 +306,32 @@ static inline struct bio *blk_bio_segment_split(struct request_queue *q,
>>  	 * are cloned, but compared to the performance impact of cloned
>>  	 * bios themselves the loop below doesn't matter anyway.
>>  	 */
>> -	if (!q->limits.chunk_sectors && bio->bi_vcnt == 1 &&
>> +	if (!q->limits.chunk_sectors && bio_segs == 1 &&
>>  	    (bio->bi_io_vec[0].bv_len +
>>  	     bio->bi_io_vec[0].bv_offset) <= PAGE_SIZE) {
>>  		*nr_segs = 1;
>>  		return NULL;
>>  	}
>>  
>> -	return __blk_bio_segment_split(q, bio, bs, nr_segs);
>> +	q_max_sectors = get_max_io_size(q, bio);
>> +	if (!queue_virt_boundary(q) && bio_segs < queue_max_segments(q) &&
>> +	    bio->bi_iter.bi_size <= queue_max_segment_size(q)) {
> 
> .bi_vcnt is 0 for fast cloned bio, so the above check may become true
> when real nr_segment is > queue_max_segments(), especially in case that
> max segments limit is small and segment size is big.

I guess we can skip the fast path for them (i.e. bi_vcnt == 0)

I'm curious, why it's 0 but not the real number?
Ming Lei Jan. 29, 2021, 2 a.m. UTC | #6
On Thu, Jan 28, 2021 at 12:27:39PM +0000, Pavel Begunkov wrote:
> On 28/01/2021 12:10, Ming Lei wrote:
> > On Tue, Jan 05, 2021 at 07:43:38PM +0000, Pavel Begunkov wrote:
> >> blk_bio_segment_split() is very heavy, but the current fast path covers
> >> only one-segment under PAGE_SIZE bios. Add another one by estimating an
> >> upper bound of sectors a bio can contain.
> >>
> >> One restricting factor here is queue_max_segment_size(), which it
> >> compare against full iter size to not dig into bvecs. By default it's
> >> 64KB, and so for requests under 64KB, but for those falling under the
> >> conditions it's much faster.
> >>
> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> >> ---
> >>  block/blk-merge.c | 29 +++++++++++++++++++++++++----
> >>  1 file changed, 25 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/block/blk-merge.c b/block/blk-merge.c
> >> index 84b9635b5d57..15d75f3ffc30 100644
> >> --- a/block/blk-merge.c
> >> +++ b/block/blk-merge.c
> >> @@ -226,12 +226,12 @@ static bool bvec_split_segs(const struct request_queue *q,
> >>  static struct bio *__blk_bio_segment_split(struct request_queue *q,
> >>  					   struct bio *bio,
> >>  					   struct bio_set *bs,
> >> -					   unsigned *segs)
> >> +					   unsigned *segs,
> >> +					   const unsigned max_sectors)
> >>  {
> >>  	struct bio_vec bv, bvprv, *bvprvp = NULL;
> >>  	struct bvec_iter iter;
> >>  	unsigned nsegs = 0, sectors = 0;
> >> -	const unsigned max_sectors = get_max_io_size(q, bio);
> >>  	const unsigned max_segs = queue_max_segments(q);
> >>  
> >>  	bio_for_each_bvec(bv, bio, iter) {
> >> @@ -295,6 +295,9 @@ static inline struct bio *blk_bio_segment_split(struct request_queue *q,
> >>  						struct bio_set *bs,
> >>  						unsigned *nr_segs)
> >>  {
> >> +	unsigned int max_sectors, q_max_sectors;
> >> +	unsigned int bio_segs = bio->bi_vcnt;
> >> +
> >>  	/*
> >>  	 * All drivers must accept single-segments bios that are <=
> >>  	 * PAGE_SIZE.  This is a quick and dirty check that relies on
> >> @@ -303,14 +306,32 @@ static inline struct bio *blk_bio_segment_split(struct request_queue *q,
> >>  	 * are cloned, but compared to the performance impact of cloned
> >>  	 * bios themselves the loop below doesn't matter anyway.
> >>  	 */
> >> -	if (!q->limits.chunk_sectors && bio->bi_vcnt == 1 &&
> >> +	if (!q->limits.chunk_sectors && bio_segs == 1 &&
> >>  	    (bio->bi_io_vec[0].bv_len +
> >>  	     bio->bi_io_vec[0].bv_offset) <= PAGE_SIZE) {
> >>  		*nr_segs = 1;
> >>  		return NULL;
> >>  	}
> >>  
> >> -	return __blk_bio_segment_split(q, bio, bs, nr_segs);
> >> +	q_max_sectors = get_max_io_size(q, bio);
> >> +	if (!queue_virt_boundary(q) && bio_segs < queue_max_segments(q) &&
> >> +	    bio->bi_iter.bi_size <= queue_max_segment_size(q)) {
> > 
> > .bi_vcnt is 0 for fast cloned bio, so the above check may become true
> > when real nr_segment is > queue_max_segments(), especially in case that
> > max segments limit is small and segment size is big.
> 
> I guess we can skip the fast path for them (i.e. bi_vcnt == 0)

But bi_vcnt can't represent real segment number, which can be bigger or
less than .bi_vcnt.

> I'm curious, why it's 0 but not the real number?

fast-cloned bio shares bvec table of original bio, so it doesn't have
.bi_vcnt.
Pavel Begunkov Feb. 1, 2021, 10:59 a.m. UTC | #7
On 29/01/2021 02:00, Ming Lei wrote:
> On Thu, Jan 28, 2021 at 12:27:39PM +0000, Pavel Begunkov wrote:
>> On 28/01/2021 12:10, Ming Lei wrote:
>>> On Tue, Jan 05, 2021 at 07:43:38PM +0000, Pavel Begunkov wrote:
>>> .bi_vcnt is 0 for fast cloned bio, so the above check may become true
>>> when real nr_segment is > queue_max_segments(), especially in case that
>>> max segments limit is small and segment size is big.
>>
>> I guess we can skip the fast path for them (i.e. bi_vcnt == 0)
> 
> But bi_vcnt can't represent real segment number, which can be bigger or
> less than .bi_vcnt.

Sounds like "go slow path on bi_vcnt==0" won't work. Ok, I need to
go look what is the real purpose of .bi_vcnt

> 
>> I'm curious, why it's 0 but not the real number?
> 
> fast-cloned bio shares bvec table of original bio, so it doesn't have
> .bi_vcnt.

Got it, thanks
diff mbox series

Patch

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 84b9635b5d57..15d75f3ffc30 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -226,12 +226,12 @@  static bool bvec_split_segs(const struct request_queue *q,
 static struct bio *__blk_bio_segment_split(struct request_queue *q,
 					   struct bio *bio,
 					   struct bio_set *bs,
-					   unsigned *segs)
+					   unsigned *segs,
+					   const unsigned max_sectors)
 {
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
 	unsigned nsegs = 0, sectors = 0;
-	const unsigned max_sectors = get_max_io_size(q, bio);
 	const unsigned max_segs = queue_max_segments(q);
 
 	bio_for_each_bvec(bv, bio, iter) {
@@ -295,6 +295,9 @@  static inline struct bio *blk_bio_segment_split(struct request_queue *q,
 						struct bio_set *bs,
 						unsigned *nr_segs)
 {
+	unsigned int max_sectors, q_max_sectors;
+	unsigned int bio_segs = bio->bi_vcnt;
+
 	/*
 	 * All drivers must accept single-segments bios that are <=
 	 * PAGE_SIZE.  This is a quick and dirty check that relies on
@@ -303,14 +306,32 @@  static inline struct bio *blk_bio_segment_split(struct request_queue *q,
 	 * are cloned, but compared to the performance impact of cloned
 	 * bios themselves the loop below doesn't matter anyway.
 	 */
-	if (!q->limits.chunk_sectors && bio->bi_vcnt == 1 &&
+	if (!q->limits.chunk_sectors && bio_segs == 1 &&
 	    (bio->bi_io_vec[0].bv_len +
 	     bio->bi_io_vec[0].bv_offset) <= PAGE_SIZE) {
 		*nr_segs = 1;
 		return NULL;
 	}
 
-	return __blk_bio_segment_split(q, bio, bs, nr_segs);
+	q_max_sectors = get_max_io_size(q, bio);
+	if (!queue_virt_boundary(q) && bio_segs < queue_max_segments(q) &&
+	    bio->bi_iter.bi_size <= queue_max_segment_size(q)) {
+		/*
+		 * Segments are contiguous, so only their ends may be not full.
+		 * An upper bound for them would to assume that each takes 1B
+		 * but adds a sector, and all left are just full sectors.
+		 * Note: it's ok to round size down because all not full
+		 * sectors are accounted by the first term.
+		 */
+		max_sectors = bio_segs * 2;
+		max_sectors += bio->bi_iter.bi_size >> 9;
+
+		if (max_sectors < q_max_sectors) {
+			*nr_segs = bio_segs;
+			return NULL;
+		}
+	}
+	return __blk_bio_segment_split(q, bio, bs, nr_segs, q_max_sectors);
 }
 
 /**