[for-4.4] block: split bios to max possible length

Message ID	CACVXFVNQQNd8_3ti4AAhZB4KN0uJEN8sdmB+xsmpykJaCLfebA@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> MIME-Version: 1.0 In-Reply-To: <20160105150938.GA20832@localhost.localdomain> References: <1451931895-17474-1-git-send-email-keith.busch@intel.com> <CACVXFVPOYra24SNmGBthcheYv_kY=5Fv-6s=-bSwwZijudXapw@mail.gmail.com> <20160105150938.GA20832@localhost.localdomain> Date: Wed, 6 Jan 2016 10:17:51 +0800 Message-ID: <CACVXFVNQQNd8_3ti4AAhZB4KN0uJEN8sdmB+xsmpykJaCLfebA@mail.gmail.com> Subject: Re: [PATCH for-4.4] block: split bios to max possible length From: Ming Lei <tom.leiming@gmail.com> To: Keith Busch <keith.busch@intel.com> Cc: linux-nvme <linux-nvme@lists.infradead.org>, linux-block@vger.kernel.org, Jens Axboe <axboe@fb.com>, Greg White <gwhite@kupulau.com> Content-Type: multipart/mixed; boundary=001a114d5f060ad48d0528a0f62c Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Ming Lei Jan. 6, 2016, 2:17 a.m. UTC

Hi Keith,

On Tue, Jan 5, 2016 at 11:09 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Tue, Jan 05, 2016 at 12:54:53PM +0800, Ming Lei wrote:
>> On Tue, Jan 5, 2016 at 2:24 AM, Keith Busch <keith.busch@intel.com> wrote:
>> > This allows bio splits in the middle of a vector to form the largest
>>
>> Wrt. the current block stack, one segment always hold one or more bvec,
>> instead of part of bvec, so better to be careful about this handling.
>
> Hi Ming,
>
> Could you help me understand your concern here? If we split a vector
> somewhere in the middle, it becomes two different bvecs. The first is
> the last segment in the first bio, the second is the first segment in
> the split bio, right?

Firstly we didn't split one single bio vector before bio splitting.

Secondly, current bio split still doesn't support to split one single
bvec into two, and it just makes the two bios shared the original
bvec table, please see bio_split(), which calls bio_clone_fast()
to do that, and the bvec table has been immutable at that time.

>
> It's not necessarily a new segment if it is physically contiguous with
> the previous (if it exists at all), but duplicating the logic to coalesce
> addresses doesn't seem to be worth that optimization.

I understand your motivation in the two patches, actually before bio splitting,
we don't do sg merge for nvme because of the flag of NO_SG_MERGE,
which is ignored after bio splitting is introduced. So could you check if
the nvme performance can be good by putting NO_SG_MERGE back
in blk_bio_segment_split()? And the change should be simple, like the
attached patch.

>
>> > possible bio at the h/w's desired alignment, and guarantees the bio being
>> > split will have some data. Previously, if the first vector's length was
>> > greater than the allowable amount, the bio would split at a zero length
>> > and hit a kernel BUG.
>>
>> That is introduced by d3805611130a, and zero length can't be splitted
>> previously because queue_max_sectors() is at least one PAGE_SIZE.
>
> Can a bvec's length exceed a PAGE_SIZE? They point to pages, so I
> suppose not.

No, it doesn't, but blk_max_size_offset() may be less than PAGE_SIZE,
then zero splitting is triggered.

>
> But it should be more efficient to split to the largest allowed by the
> hardware. We can contrive a scenario where a bio would be split many

Previously Jens took the opposite approach to make each bvec
as one segment, and he mentioned performance is increased.

> times more than necessary without this patch. Let's say queue_max_sectors
> is a PAGE_SIZE, and we want to submit '2 * PAGE_SIZE' worth of data
> addressed in 3 bvecs. Previously that would split three times; now it
> will split only twice.

IMO, splitting is quite cheap, or we still can increase the limit of
queue_max_sectors() to the hardware allowed value.

Keith Busch Jan. 6, 2016, 5:51 a.m. UTC | #1

On Wed, Jan 06, 2016 at 10:17:51AM +0800, Ming Lei wrote:
> Firstly we didn't split one single bio vector before bio splitting.
>
> Secondly, current bio split still doesn't support to split one single
> bvec into two, and it just makes the two bios shared the original
> bvec table, please see bio_split(), which calls bio_clone_fast()
> to do that, and the bvec table has been immutable at that time.

You are saying we can't split a bio in the middle of a vector?
bvec_iter_advance() says we can split anywhere.

> I understand your motivation in the two patches, actually before bio splitting,
> we don't do sg merge for nvme because of the flag of NO_SG_MERGE,
> which is ignored after bio splitting is introduced. So could you check if
> the nvme performance can be good by putting NO_SG_MERGE back
> in blk_bio_segment_split()? And the change should be simple, like the
> attached patch.

The nvme driver is okay to take physically merged pages. It splits them
into PRPs accordingly, and it's faster to DMA map physically contiguous
just because there are fewer segments to iterate, so NVMe would prefer
to let them coalesce.

> IMO, splitting is quite cheap,

Splitting is absolutely not cheap if your media is fast enough. I measure
every % of increased CPU instructions as the same % decrease in IOPs
with 3DXpoint.

But this patch was to split on stripe boundaries, which is an even worse
penalty if we get the split wrong.

> +	bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags);
>  
>  	bio_for_each_segment(bv, bio, iter) {
> -		if (sectors + (bv.bv_len >> 9) > blk_max_size_offset(q, bio->bi_iter.bi_sector))
> +		if (no_sg_merge)
> +			goto new_segment;

Bad idea for NVMe. We need to split on SG gaps, which you've skipped,
or you will BUG_ON in the NVMe driver.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei Jan. 6, 2016, 6:29 a.m. UTC | #2

On Wed, Jan 6, 2016 at 1:51 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Wed, Jan 06, 2016 at 10:17:51AM +0800, Ming Lei wrote:
>> Firstly we didn't split one single bio vector before bio splitting.
>>
>> Secondly, current bio split still doesn't support to split one single
>> bvec into two, and it just makes the two bios shared the original
>> bvec table, please see bio_split(), which calls bio_clone_fast()
>> to do that, and the bvec table has been immutable at that time.
>
> You are saying we can't split a bio in the middle of a vector?

I mean the current block stack may not be ready for that.

> bvec_iter_advance() says we can split anywhere.

bvec_iter_advance() can split anywhere, but the splitted bios may
cross one same bvec, which may cause trouble, for example,
BIOVEC_PHYS_MERGEABLE() may not work well and
blk_phys_contig_segment() too.

Also not mention your patch doesn't update front_seg_size/back_seg_size/
seg_size correctly.

That is why I said we should be careful about splitting bvec because it
is never used or supported before.

>
>> I understand your motivation in the two patches, actually before bio splitting,
>> we don't do sg merge for nvme because of the flag of NO_SG_MERGE,
>> which is ignored after bio splitting is introduced. So could you check if
>> the nvme performance can be good by putting NO_SG_MERGE back
>> in blk_bio_segment_split()? And the change should be simple, like the
>> attached patch.
>
> The nvme driver is okay to take physically merged pages. It splits them
> into PRPs accordingly, and it's faster to DMA map physically contiguous
> just because there are fewer segments to iterate, so NVMe would prefer
> to let them coalesce.

If so, NO_SG_MERGE should be removed now.

>
>> IMO, splitting is quite cheap,
>
> Splitting is absolutely not cheap if your media is fast enough. I measure
> every % of increased CPU instructions as the same % decrease in IOPs
> with 3DXpoint.

Could you share your test case? Last time I use null_blk to observe
the performance difference between NO_SG_MERGE vs. non-NO_SG_MERGE,
and basically no difference is observed.

>
> But this patch was to split on stripe boundaries, which is an even worse
> penalty if we get the split wrong.
>
>> +     bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags);
>>
>>       bio_for_each_segment(bv, bio, iter) {
>> -             if (sectors + (bv.bv_len >> 9) > blk_max_size_offset(q, bio->bi_iter.bi_sector))
>> +             if (no_sg_merge)
>> +                     goto new_segment;
>
> Bad idea for NVMe. We need to split on SG gaps, which you've skipped,
> or you will BUG_ON in the NVMe driver.

I don't object to the idea of split on SG gap, and I mean we should make
sure the implementation is correct.

Thanks,
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei Jan. 6, 2016, 6:53 a.m. UTC | #3

On Wed, Jan 6, 2016 at 2:29 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Wed, Jan 6, 2016 at 1:51 PM, Keith Busch <keith.busch@intel.com> wrote:
>> On Wed, Jan 06, 2016 at 10:17:51AM +0800, Ming Lei wrote:
>>> Firstly we didn't split one single bio vector before bio splitting.
>>>
>>> Secondly, current bio split still doesn't support to split one single
>>> bvec into two, and it just makes the two bios shared the original
>>> bvec table, please see bio_split(), which calls bio_clone_fast()
>>> to do that, and the bvec table has been immutable at that time.
>>
>> You are saying we can't split a bio in the middle of a vector?
>
> I mean the current block stack may not be ready for that.
>
>> bvec_iter_advance() says we can split anywhere.
>
> bvec_iter_advance() can split anywhere, but the splitted bios may
> cross one same bvec, which may cause trouble, for example,
> BIOVEC_PHYS_MERGEABLE() may not work well and
> blk_phys_contig_segment() too.

blk_rq_map_sg() isn't ready for splitting in the middle of bvec too.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Keith Busch Jan. 6, 2016, 3:18 p.m. UTC | #4

On Wed, Jan 06, 2016 at 02:53:22PM +0800, Ming Lei wrote:
> On Wed, Jan 6, 2016 at 2:29 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> >
> > bvec_iter_advance() can split anywhere, but the splitted bios may
> > cross one same bvec, which may cause trouble, 

Cross? One starts where the other starts. Could you please explain what's
wrong?

> > for example,
> > BIOVEC_PHYS_MERGEABLE() may not work well and
> > blk_phys_contig_segment() too.

Could you please explain why it may not work well? I have no idea what
concern you have with BIOVEC_PHYS_MERGEABLE in this case.

The only place blk_phys_contig_segment is called from is
ll_merge_requests, and the first req of the split bio wouldn't even go
through here because it's marked REQ_NOMERGE.

> blk_rq_map_sg() isn't ready for splitting in the middle of bvec too.

Could you please explain? It just uses the bio iterator, which I'm pretty
sure is not broken.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Keith Busch Jan. 6, 2016, 3:29 p.m. UTC | #5

On Wed, Jan 06, 2016 at 02:29:16PM +0800, Ming Lei wrote:
> Also not mention your patch doesn't update front_seg_size/back_seg_size/
> seg_size correctly.
> 
> That is why I said we should be careful about splitting bvec because it
> is never used or supported before.

Ok, that's an easy fix, but it's not really a functional difference. The
request is not mergeable after this, which is the only use for back and
front seg sizes.

> > The nvme driver is okay to take physically merged pages. It splits them
> > into PRPs accordingly, and it's faster to DMA map physically contiguous
> > just because there are fewer segments to iterate, so NVMe would prefer
> > to let them coalesce.
> 
> If so, NO_SG_MERGE should be removed now.
 
Sounds good.

> Could you share your test case? Last time I use null_blk to observe
> the performance difference between NO_SG_MERGE vs. non-NO_SG_MERGE,
> and basically no difference is observed.

null_blk doesn't do anything, so it's probably difficult to measure
that. Use a driver that DMA maps the data, uses MMIO to submit to h/w
and takes an interrupt. You can measure all these when you create 50%
more commands than necessary with unneeded splits.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei Jan. 6, 2016, 3:43 p.m. UTC | #6

On Wed, Jan 6, 2016 at 11:18 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Wed, Jan 06, 2016 at 02:53:22PM +0800, Ming Lei wrote:
>> On Wed, Jan 6, 2016 at 2:29 PM, Ming Lei <tom.leiming@gmail.com> wrote:
>> >
>> > bvec_iter_advance() can split anywhere, but the splitted bios may
>> > cross one same bvec, which may cause trouble,
>
> Cross? One starts where the other starts. Could you please explain what's
> wrong?

Suppose bio is splitted as bio1 and bio2, then the last bvec of bio1 is same
with the first bvec of bio2.

>
>> > for example,
>> > BIOVEC_PHYS_MERGEABLE() may not work well and
>> > blk_phys_contig_segment() too.
>
> Could you please explain why it may not work well? I have no idea what
> concern you have with BIOVEC_PHYS_MERGEABLE in this case.
>
> The only place blk_phys_contig_segment is called from is
> ll_merge_requests, and the first req of the split bio wouldn't even go
> through here because it's marked REQ_NOMERGE.

OK, looks I forget this change.

>
>> blk_rq_map_sg() isn't ready for splitting in the middle of bvec too.
>
> Could you please explain? It just uses the bio iterator, which I'm pretty
> sure is not broken.

Please see the 1st line code of __blk_segment_map_sg(), in which only
one whole bvec is handled, and partial bvec can't be figured out there.

Think of it further, drivers often use bv.bv_len directly in the
iterator, for example:

        bio_for_each_segment(bvec, bio, iter)
                memcpy(page_address(bvec.bv_page) +
                                            bvec.bv_offset,  addr +
offset, bvec.bv_len);

So your patch will break these drivers, won't it?

Keith Busch Jan. 6, 2016, 4:21 p.m. UTC | #7

On Wed, Jan 06, 2016 at 11:43:45PM +0800, Ming Lei wrote:
> Please see the 1st line code of __blk_segment_map_sg(), in which only
> one whole bvec is handled, and partial bvec can't be figured out there.
> 
> Think of it further, drivers often use bv.bv_len directly in the
> iterator, for example:
> 
>         bio_for_each_segment(bvec, bio, iter)
>                 memcpy(page_address(bvec.bv_page) +
>                                             bvec.bv_offset,  addr +
> offset, bvec.bv_len);
> 
> So your patch will break these drivers, won't it?

CC'ing Kent in hopes he will clarify what happens on a split.

The bio_advance() code comments say it's handled:

"
 * This updates bi_sector, bi_size and bi_idx; if the number of bytes to
 * complete doesn't align with a bvec boundary, then bv_len and bv_offset will
 * be updated on the last bvec as well.
"

I admit I'm having a hard time seeing where bv_len and bv_offset updated
in this path. It was obviously handled after 054bdf646e then changed
with 4550dd6c6b.

If I follow correctly, 4550dd6c6b will implicity update the bvec's offset
and length during the split here since bio_iter_iovec resets the bvec's
length and offset:
---
#define __bio_for_each_segment(bvl, bio, iter, start)			\
	for (iter = (start);						\
	     (iter).bi_size &&						\
		((bvl = bio_iter_iovec((bio), (iter))), 1);		\
	     bio_advance_iter((bio), &(iter), (bvl).bv_len))
--
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei Jan. 7, 2016, 12:14 a.m. UTC | #8

On Thu, Jan 7, 2016 at 12:21 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Wed, Jan 06, 2016 at 11:43:45PM +0800, Ming Lei wrote:
>> Please see the 1st line code of __blk_segment_map_sg(), in which only
>> one whole bvec is handled, and partial bvec can't be figured out there.
>>
>> Think of it further, drivers often use bv.bv_len directly in the
>> iterator, for example:
>>
>>         bio_for_each_segment(bvec, bio, iter)
>>                 memcpy(page_address(bvec.bv_page) +
>>                                             bvec.bv_offset,  addr +
>> offset, bvec.bv_len);
>>
>> So your patch will break these drivers, won't it?
>
> CC'ing Kent in hopes he will clarify what happens on a split.
>
> The bio_advance() code comments say it's handled:
>
> "
>  * This updates bi_sector, bi_size and bi_idx; if the number of bytes to
>  * complete doesn't align with a bvec boundary, then bv_len and bv_offset will
>  * be updated on the last bvec as well.
> "

One exception is that both can be reseted in bio_clone_bioset().

>
> I admit I'm having a hard time seeing where bv_len and bv_offset updated
> in this path. It was obviously handled after 054bdf646e then changed
> with 4550dd6c6b.
>
> If I follow correctly, 4550dd6c6b will implicity update the bvec's offset
> and length during the split here since bio_iter_iovec resets the bvec's
> length and offset:
> ---
> #define __bio_for_each_segment(bvl, bio, iter, start)                   \
>         for (iter = (start);                                            \
>              (iter).bi_size &&                                          \
>                 ((bvl = bio_iter_iovec((bio), (iter))), 1);             \
>              bio_advance_iter((bio), &(iter), (bvl).bv_len))
> --

Kent Overstreet Jan. 7, 2016, 10:46 a.m. UTC | #9

On Wed, Jan 06, 2016 at 04:21:17PM +0000, Keith Busch wrote:
> On Wed, Jan 06, 2016 at 11:43:45PM +0800, Ming Lei wrote:
> > Please see the 1st line code of __blk_segment_map_sg(), in which only
> > one whole bvec is handled, and partial bvec can't be figured out there.
> > 
> > Think of it further, drivers often use bv.bv_len directly in the
> > iterator, for example:
> > 
> >         bio_for_each_segment(bvec, bio, iter)
> >                 memcpy(page_address(bvec.bv_page) +
> >                                             bvec.bv_offset,  addr +
> > offset, bvec.bv_len);
> > 
> > So your patch will break these drivers, won't it?
> 
> CC'ing Kent in hopes he will clarify what happens on a split.
> 
> The bio_advance() code comments say it's handled:
> 
> "
>  * This updates bi_sector, bi_size and bi_idx; if the number of bytes to
>  * complete doesn't align with a bvec boundary, then bv_len and bv_offset will
>  * be updated on the last bvec as well.
> "
> 
> I admit I'm having a hard time seeing where bv_len and bv_offset updated
> in this path. It was obviously handled after 054bdf646e then changed
> with 4550dd6c6b.
> 
> If I follow correctly, 4550dd6c6b will implicity update the bvec's offset
> and length during the split here since bio_iter_iovec resets the bvec's
> length and offset:
> ---
> #define __bio_for_each_segment(bvl, bio, iter, start)			\
> 	for (iter = (start);						\
> 	     (iter).bi_size &&						\
> 		((bvl = bio_iter_iovec((bio), (iter))), 1);		\
> 	     bio_advance_iter((bio), &(iter), (bvl).bv_len))
> --

Yes, splitting in the middle of a bvec is perfectly fine. The reason
bio_for_each_segment takes a struct bvec and not a struct bvec * is because it's
computing what bv_len should be (taking the min of bv_len and bi_size, roughly).

See include/linux/bio.h:

bio_for_each_segment()
  bio_iter_iovec()
    bvec_iter_bvec()
      bvec_iter_len()

which does the actual bv_len computation.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei Jan. 9, 2016, 11:10 a.m. UTC | #10

On Thu, Jan 7, 2016 at 6:46 PM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Wed, Jan 06, 2016 at 04:21:17PM +0000, Keith Busch wrote:
...
> Yes, splitting in the middle of a bvec is perfectly fine. The reason
> bio_for_each_segment takes a struct bvec and not a struct bvec * is because it's
> computing what bv_len should be (taking the min of bv_len and bi_size, roughly).
>
> See include/linux/bio.h:
>
> bio_for_each_segment()
>   bio_iter_iovec()
>     bvec_iter_bvec()
>       bvec_iter_len()
>
> which does the actual bv_len computation.

Looks my understanding was wrong, and Keith's patch is correct, sorry
for the noise.

[for-4.4] block: split bios to max possible length

Commit Message

Comments

Patch