diff mbox series

[V4] block: optimize for small block size IO

Message ID 20191102072911.24817-1-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series [V4] block: optimize for small block size IO | expand

Commit Message

Ming Lei Nov. 2, 2019, 7:29 a.m. UTC
__blk_queue_split() may be a bit heavy for small block size(such as
512B, or 4KB) IO, so introduce one flag to decide if this bio includes
multiple page. And only consider to try splitting this bio in case
that the multiple page flag is set.

~3% - 5% IOPS improvement can be observed on io_uring test over
null_blk(MQ), and the io_uring test code is from fio/t/io_uring.c

bch_bio_map() should be the only one which doesn't use bio_add_page(),
so force to mark bio built via bch_bio_map() as MULTI_PAGE.

RAID5 has similar usage too, however the bio is really single-page bio,
so not necessary to handle it.

Cc: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <kbusch@kernel.org>
Cc: linux-bcache@vger.kernel.org
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V4:
	- patch style chnage as suggested by Jens
V3:
	- simplify check in __bio_add_page() as suggested by Christoph
V2:
	- share bit flag with passthrough IO
	- deal with adding multipage in one bio_add_page()

 block/bio.c               | 9 +++++++++
 block/blk-merge.c         | 8 +++++++-
 block/bounce.c            | 3 +++
 drivers/md/bcache/util.c  | 2 ++
 include/linux/blk_types.h | 3 +++
 5 files changed, 24 insertions(+), 1 deletion(-)

Comments

Jens Axboe Nov. 2, 2019, 2:03 p.m. UTC | #1
On 11/2/19 1:29 AM, Ming Lei wrote:
> __blk_queue_split() may be a bit heavy for small block size(such as
> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> multiple page. And only consider to try splitting this bio in case
> that the multiple page flag is set.
> 
> ~3% - 5% IOPS improvement can be observed on io_uring test over
> null_blk(MQ), and the io_uring test code is from fio/t/io_uring.c
> 
> bch_bio_map() should be the only one which doesn't use bio_add_page(),
> so force to mark bio built via bch_bio_map() as MULTI_PAGE.
> 
> RAID5 has similar usage too, however the bio is really single-page bio,
> so not necessary to handle it.

Thanks Ming, applied.
Jens Axboe Nov. 2, 2019, 3:57 p.m. UTC | #2
On 11/2/19 8:03 AM, Jens Axboe wrote:
> On 11/2/19 1:29 AM, Ming Lei wrote:
>> __blk_queue_split() may be a bit heavy for small block size(such as
>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
>> multiple page. And only consider to try splitting this bio in case
>> that the multiple page flag is set.
>>
>> ~3% - 5% IOPS improvement can be observed on io_uring test over
>> null_blk(MQ), and the io_uring test code is from fio/t/io_uring.c
>>
>> bch_bio_map() should be the only one which doesn't use bio_add_page(),
>> so force to mark bio built via bch_bio_map() as MULTI_PAGE.
>>
>> RAID5 has similar usage too, however the bio is really single-page bio,
>> so not necessary to handle it.
> 
> Thanks Ming, applied.

Actually, I took a closer look at this. I thought the BIO_MAP_USER
overload would be ok, but that seems potentially fragile and so does
the fact that we need to now maintain an extra state for multipage.
Any serious objections to just doing the somewhat hacky bio->bi_vcnt
check? With a comment I think that's more acceptable, and it doesn't
rely on maintaining extra state. Particularly the latter is a big
win, imho.
Ming Lei Nov. 4, 2019, 12:01 a.m. UTC | #3
On Sat, Nov 2, 2019 at 11:58 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 11/2/19 8:03 AM, Jens Axboe wrote:
> > On 11/2/19 1:29 AM, Ming Lei wrote:
> >> __blk_queue_split() may be a bit heavy for small block size(such as
> >> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> >> multiple page. And only consider to try splitting this bio in case
> >> that the multiple page flag is set.
> >>
> >> ~3% - 5% IOPS improvement can be observed on io_uring test over
> >> null_blk(MQ), and the io_uring test code is from fio/t/io_uring.c
> >>
> >> bch_bio_map() should be the only one which doesn't use bio_add_page(),
> >> so force to mark bio built via bch_bio_map() as MULTI_PAGE.
> >>
> >> RAID5 has similar usage too, however the bio is really single-page bio,
> >> so not necessary to handle it.
> >
> > Thanks Ming, applied.
>
> Actually, I took a closer look at this. I thought the BIO_MAP_USER
> overload would be ok, but that seems potentially fragile and so does
> the fact that we need to now maintain an extra state for multipage.
> Any serious objections to just doing the somewhat hacky bio->bi_vcnt
> check? With a comment I think that's more acceptable, and it doesn't
> rely on maintaining extra state. Particularly the latter is a big
> win, imho.

I am fine with checking bio->bi_vcnt with comment.

Thanks,
Ming Lei
Kent Overstreet Nov. 4, 2019, 6:14 p.m. UTC | #4
On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> __blk_queue_split() may be a bit heavy for small block size(such as
> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> multiple page. And only consider to try splitting this bio in case
> that the multiple page flag is set.

So, back in the day I had an alternative approach in mind: get rid of
blk_queue_split entirely, by pushing splitting down to the request layer - when
we map the bio/request to sgl, just have it map as much as will fit in the sgl
and if it doesn't entirely fit bump bi_remaining and leave it on the request
queue.

This would mean there'd be no need for counting segments at all, and would cut a
fair amount of code out of the io path.
Christoph Hellwig Nov. 4, 2019, 6:15 p.m. UTC | #5
On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> > __blk_queue_split() may be a bit heavy for small block size(such as
> > 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> > multiple page. And only consider to try splitting this bio in case
> > that the multiple page flag is set.
> 
> So, back in the day I had an alternative approach in mind: get rid of
> blk_queue_split entirely, by pushing splitting down to the request layer - when
> we map the bio/request to sgl, just have it map as much as will fit in the sgl
> and if it doesn't entirely fit bump bi_remaining and leave it on the request
> queue.
> 
> This would mean there'd be no need for counting segments at all, and would cut a
> fair amount of code out of the io path.

I thought about that to, but it will take a lot more effort.  Mostly
because md/dm heavily rely on splitting as well.  I still think it is
worthwhile, it will just take a significant amount of time and we
should have the quick improvement now.
Kent Overstreet Nov. 4, 2019, 6:17 p.m. UTC | #6
On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
> > On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> > > __blk_queue_split() may be a bit heavy for small block size(such as
> > > 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> > > multiple page. And only consider to try splitting this bio in case
> > > that the multiple page flag is set.
> > 
> > So, back in the day I had an alternative approach in mind: get rid of
> > blk_queue_split entirely, by pushing splitting down to the request layer - when
> > we map the bio/request to sgl, just have it map as much as will fit in the sgl
> > and if it doesn't entirely fit bump bi_remaining and leave it on the request
> > queue.
> > 
> > This would mean there'd be no need for counting segments at all, and would cut a
> > fair amount of code out of the io path.
> 
> I thought about that to, but it will take a lot more effort.  Mostly
> because md/dm heavily rely on splitting as well.  I still think it is
> worthwhile, it will just take a significant amount of time and we
> should have the quick improvement now.

We can do it one driver at a time - driver sets a flag to disable
blk_queue_split(). Obvious one to do first would be nvme since that's where it
shows up the most.

And md/md do splitting internally, but I'm not so sure they need
blk_queue_split().
Jens Axboe Nov. 4, 2019, 6:23 p.m. UTC | #7
On 11/4/19 11:17 AM, Kent Overstreet wrote:
> On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
>> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
>>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
>>>> __blk_queue_split() may be a bit heavy for small block size(such as
>>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
>>>> multiple page. And only consider to try splitting this bio in case
>>>> that the multiple page flag is set.
>>>
>>> So, back in the day I had an alternative approach in mind: get rid of
>>> blk_queue_split entirely, by pushing splitting down to the request layer - when
>>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
>>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
>>> queue.
>>>
>>> This would mean there'd be no need for counting segments at all, and would cut a
>>> fair amount of code out of the io path.
>>
>> I thought about that to, but it will take a lot more effort.  Mostly
>> because md/dm heavily rely on splitting as well.  I still think it is
>> worthwhile, it will just take a significant amount of time and we
>> should have the quick improvement now.
> 
> We can do it one driver at a time - driver sets a flag to disable
> blk_queue_split(). Obvious one to do first would be nvme since that's where it
> shows up the most.
> 
> And md/md do splitting internally, but I'm not so sure they need
> blk_queue_split().

I'm a big proponent of doing something like that instead, but it is a
lot of work. I absolutely hate the splitting we're doing now, even
though the original "let's work as hard as we add add page time to get
things right" was pretty abysmal as well.
Kent Overstreet Nov. 4, 2019, 6:42 p.m. UTC | #8
On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote:
> On 11/4/19 11:17 AM, Kent Overstreet wrote:
> > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
> >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
> >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> >>>> __blk_queue_split() may be a bit heavy for small block size(such as
> >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> >>>> multiple page. And only consider to try splitting this bio in case
> >>>> that the multiple page flag is set.
> >>>
> >>> So, back in the day I had an alternative approach in mind: get rid of
> >>> blk_queue_split entirely, by pushing splitting down to the request layer - when
> >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
> >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
> >>> queue.
> >>>
> >>> This would mean there'd be no need for counting segments at all, and would cut a
> >>> fair amount of code out of the io path.
> >>
> >> I thought about that to, but it will take a lot more effort.  Mostly
> >> because md/dm heavily rely on splitting as well.  I still think it is
> >> worthwhile, it will just take a significant amount of time and we
> >> should have the quick improvement now.
> > 
> > We can do it one driver at a time - driver sets a flag to disable
> > blk_queue_split(). Obvious one to do first would be nvme since that's where it
> > shows up the most.
> > 
> > And md/md do splitting internally, but I'm not so sure they need
> > blk_queue_split().
> 
> I'm a big proponent of doing something like that instead, but it is a
> lot of work. I absolutely hate the splitting we're doing now, even
> though the original "let's work as hard as we add add page time to get
> things right" was pretty abysmal as well.

Last I looked I don't think it was going to be that bad, just needed a bit of
finesse. We just need to be able to partially process a request in e.g.
nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as
will fit instead of popping an assertion.
Ming Lei Nov. 5, 2019, 12:44 a.m. UTC | #9
On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> > __blk_queue_split() may be a bit heavy for small block size(such as
> > 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> > multiple page. And only consider to try splitting this bio in case
> > that the multiple page flag is set.
> 
> So, back in the day I had an alternative approach in mind: get rid of
> blk_queue_split entirely, by pushing splitting down to the request layer - when
> we map the bio/request to sgl, just have it map as much as will fit in the sgl
> and if it doesn't entirely fit bump bi_remaining and leave it on the request
> queue.

Many drivers don't need sgl via blk_rq_map_sg(), but still need to split bio.

> 
> This would mean there'd be no need for counting segments at all, and would cut a
> fair amount of code out of the io path.

No counting segments involved in this small block size case, the handling
in blk_bio_segment_split() should be simple enough, still not understand
why IOPS drop is observable.

Thanks,
Ming
Ming Lei Nov. 5, 2019, 1:11 a.m. UTC | #10
On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote:
> On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote:
> > On 11/4/19 11:17 AM, Kent Overstreet wrote:
> > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
> > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
> > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> > >>>> __blk_queue_split() may be a bit heavy for small block size(such as
> > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> > >>>> multiple page. And only consider to try splitting this bio in case
> > >>>> that the multiple page flag is set.
> > >>>
> > >>> So, back in the day I had an alternative approach in mind: get rid of
> > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when
> > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
> > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
> > >>> queue.
> > >>>
> > >>> This would mean there'd be no need for counting segments at all, and would cut a
> > >>> fair amount of code out of the io path.
> > >>
> > >> I thought about that to, but it will take a lot more effort.  Mostly
> > >> because md/dm heavily rely on splitting as well.  I still think it is
> > >> worthwhile, it will just take a significant amount of time and we
> > >> should have the quick improvement now.
> > > 
> > > We can do it one driver at a time - driver sets a flag to disable
> > > blk_queue_split(). Obvious one to do first would be nvme since that's where it
> > > shows up the most.
> > > 
> > > And md/md do splitting internally, but I'm not so sure they need
> > > blk_queue_split().
> > 
> > I'm a big proponent of doing something like that instead, but it is a
> > lot of work. I absolutely hate the splitting we're doing now, even
> > though the original "let's work as hard as we add add page time to get
> > things right" was pretty abysmal as well.
> 
> Last I looked I don't think it was going to be that bad, just needed a bit of
> finesse. We just need to be able to partially process a request in e.g.
> nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as
> will fit instead of popping an assertion.

I think it may not be doable.

blk_rq_map_sg() is called by drivers and has to work on single request, however
more requests have to be involved if we delay the splitting to blk_rq_map_sg().
Cause splitting means that two bios can't be submitted in single IO request.


Thanks,
Ming
Kent Overstreet Nov. 5, 2019, 2:11 a.m. UTC | #11
On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote:
> On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote:
> > On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote:
> > > On 11/4/19 11:17 AM, Kent Overstreet wrote:
> > > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
> > > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
> > > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> > > >>>> __blk_queue_split() may be a bit heavy for small block size(such as
> > > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> > > >>>> multiple page. And only consider to try splitting this bio in case
> > > >>>> that the multiple page flag is set.
> > > >>>
> > > >>> So, back in the day I had an alternative approach in mind: get rid of
> > > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when
> > > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
> > > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
> > > >>> queue.
> > > >>>
> > > >>> This would mean there'd be no need for counting segments at all, and would cut a
> > > >>> fair amount of code out of the io path.
> > > >>
> > > >> I thought about that to, but it will take a lot more effort.  Mostly
> > > >> because md/dm heavily rely on splitting as well.  I still think it is
> > > >> worthwhile, it will just take a significant amount of time and we
> > > >> should have the quick improvement now.
> > > > 
> > > > We can do it one driver at a time - driver sets a flag to disable
> > > > blk_queue_split(). Obvious one to do first would be nvme since that's where it
> > > > shows up the most.
> > > > 
> > > > And md/md do splitting internally, but I'm not so sure they need
> > > > blk_queue_split().
> > > 
> > > I'm a big proponent of doing something like that instead, but it is a
> > > lot of work. I absolutely hate the splitting we're doing now, even
> > > though the original "let's work as hard as we add add page time to get
> > > things right" was pretty abysmal as well.
> > 
> > Last I looked I don't think it was going to be that bad, just needed a bit of
> > finesse. We just need to be able to partially process a request in e.g.
> > nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as
> > will fit instead of popping an assertion.
> 
> I think it may not be doable.
> 
> blk_rq_map_sg() is called by drivers and has to work on single request, however
> more requests have to be involved if we delay the splitting to blk_rq_map_sg().
> Cause splitting means that two bios can't be submitted in single IO request.

Of course it's doable, do I have to show you how?
Ming Lei Nov. 5, 2019, 2:20 a.m. UTC | #12
On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote:
> On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote:
> > On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote:
> > > On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote:
> > > > On 11/4/19 11:17 AM, Kent Overstreet wrote:
> > > > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
> > > > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
> > > > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> > > > >>>> __blk_queue_split() may be a bit heavy for small block size(such as
> > > > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> > > > >>>> multiple page. And only consider to try splitting this bio in case
> > > > >>>> that the multiple page flag is set.
> > > > >>>
> > > > >>> So, back in the day I had an alternative approach in mind: get rid of
> > > > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when
> > > > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
> > > > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
> > > > >>> queue.
> > > > >>>
> > > > >>> This would mean there'd be no need for counting segments at all, and would cut a
> > > > >>> fair amount of code out of the io path.
> > > > >>
> > > > >> I thought about that to, but it will take a lot more effort.  Mostly
> > > > >> because md/dm heavily rely on splitting as well.  I still think it is
> > > > >> worthwhile, it will just take a significant amount of time and we
> > > > >> should have the quick improvement now.
> > > > > 
> > > > > We can do it one driver at a time - driver sets a flag to disable
> > > > > blk_queue_split(). Obvious one to do first would be nvme since that's where it
> > > > > shows up the most.
> > > > > 
> > > > > And md/md do splitting internally, but I'm not so sure they need
> > > > > blk_queue_split().
> > > > 
> > > > I'm a big proponent of doing something like that instead, but it is a
> > > > lot of work. I absolutely hate the splitting we're doing now, even
> > > > though the original "let's work as hard as we add add page time to get
> > > > things right" was pretty abysmal as well.
> > > 
> > > Last I looked I don't think it was going to be that bad, just needed a bit of
> > > finesse. We just need to be able to partially process a request in e.g.
> > > nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as
> > > will fit instead of popping an assertion.
> > 
> > I think it may not be doable.
> > 
> > blk_rq_map_sg() is called by drivers and has to work on single request, however
> > more requests have to be involved if we delay the splitting to blk_rq_map_sg().
> > Cause splitting means that two bios can't be submitted in single IO request.
> 
> Of course it's doable, do I have to show you how?

No, you don't have to, could you just point out where my above words is wrong?


Thanks,
Ming
Kent Overstreet Nov. 5, 2019, 2:30 a.m. UTC | #13
On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote:
> On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote:
> > On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote:
> > > On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote:
> > > > On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote:
> > > > > On 11/4/19 11:17 AM, Kent Overstreet wrote:
> > > > > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
> > > > > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
> > > > > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> > > > > >>>> __blk_queue_split() may be a bit heavy for small block size(such as
> > > > > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> > > > > >>>> multiple page. And only consider to try splitting this bio in case
> > > > > >>>> that the multiple page flag is set.
> > > > > >>>
> > > > > >>> So, back in the day I had an alternative approach in mind: get rid of
> > > > > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when
> > > > > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
> > > > > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
> > > > > >>> queue.
> > > > > >>>
> > > > > >>> This would mean there'd be no need for counting segments at all, and would cut a
> > > > > >>> fair amount of code out of the io path.
> > > > > >>
> > > > > >> I thought about that to, but it will take a lot more effort.  Mostly
> > > > > >> because md/dm heavily rely on splitting as well.  I still think it is
> > > > > >> worthwhile, it will just take a significant amount of time and we
> > > > > >> should have the quick improvement now.
> > > > > > 
> > > > > > We can do it one driver at a time - driver sets a flag to disable
> > > > > > blk_queue_split(). Obvious one to do first would be nvme since that's where it
> > > > > > shows up the most.
> > > > > > 
> > > > > > And md/md do splitting internally, but I'm not so sure they need
> > > > > > blk_queue_split().
> > > > > 
> > > > > I'm a big proponent of doing something like that instead, but it is a
> > > > > lot of work. I absolutely hate the splitting we're doing now, even
> > > > > though the original "let's work as hard as we add add page time to get
> > > > > things right" was pretty abysmal as well.
> > > > 
> > > > Last I looked I don't think it was going to be that bad, just needed a bit of
> > > > finesse. We just need to be able to partially process a request in e.g.
> > > > nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as
> > > > will fit instead of popping an assertion.
> > > 
> > > I think it may not be doable.
> > > 
> > > blk_rq_map_sg() is called by drivers and has to work on single request, however
> > > more requests have to be involved if we delay the splitting to blk_rq_map_sg().
> > > Cause splitting means that two bios can't be submitted in single IO request.
> > 
> > Of course it's doable, do I have to show you how?
> 
> No, you don't have to, could you just point out where my above words is wrong?

blk_rq_map_sg() _currently_ works on a single request, but as I said from the
start that this would involve changing it to only process as much of a request
as would fit on an sglist.

Drivers will have to be modified, but the changes to driver code should be
pretty easy. What will be slightly trickier will be changing blk-mq to handle
requests that are only partially completed; that will be harder than it would
have been before blk-mq, since the old request queue code used to handle
partially completed requests - not much work would have to be done that code.

I'm not very familiar with the blk-mq code, so Jens would be better qualified to
say how best to change that code. The basic idea would probably be the same as
how bios how have a refcount - bi_remaining - to track splits/completions. If
requests (in blk-mq land) don't have such a refcount (they don't appear to), it
will have to be added.

From a quick glance, blk_mq_complete_request() is where the refcount put will
have to be added. I haven't found where requests are popped off the request
queue in blk-mq land yet - the code will have to be changed to only do that once
the request has been fully mapped and submitted by the driver.
Jens Axboe Nov. 5, 2019, 2:38 a.m. UTC | #14
On 11/4/19 7:30 PM, Kent Overstreet wrote:
> On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote:
>> On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote:
>>> On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote:
>>>> On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote:
>>>>> On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote:
>>>>>> On 11/4/19 11:17 AM, Kent Overstreet wrote:
>>>>>>> On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
>>>>>>>> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
>>>>>>>>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
>>>>>>>>>> __blk_queue_split() may be a bit heavy for small block size(such as
>>>>>>>>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
>>>>>>>>>> multiple page. And only consider to try splitting this bio in case
>>>>>>>>>> that the multiple page flag is set.
>>>>>>>>>
>>>>>>>>> So, back in the day I had an alternative approach in mind: get rid of
>>>>>>>>> blk_queue_split entirely, by pushing splitting down to the request layer - when
>>>>>>>>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
>>>>>>>>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
>>>>>>>>> queue.
>>>>>>>>>
>>>>>>>>> This would mean there'd be no need for counting segments at all, and would cut a
>>>>>>>>> fair amount of code out of the io path.
>>>>>>>>
>>>>>>>> I thought about that to, but it will take a lot more effort.  Mostly
>>>>>>>> because md/dm heavily rely on splitting as well.  I still think it is
>>>>>>>> worthwhile, it will just take a significant amount of time and we
>>>>>>>> should have the quick improvement now.
>>>>>>>
>>>>>>> We can do it one driver at a time - driver sets a flag to disable
>>>>>>> blk_queue_split(). Obvious one to do first would be nvme since that's where it
>>>>>>> shows up the most.
>>>>>>>
>>>>>>> And md/md do splitting internally, but I'm not so sure they need
>>>>>>> blk_queue_split().
>>>>>>
>>>>>> I'm a big proponent of doing something like that instead, but it is a
>>>>>> lot of work. I absolutely hate the splitting we're doing now, even
>>>>>> though the original "let's work as hard as we add add page time to get
>>>>>> things right" was pretty abysmal as well.
>>>>>
>>>>> Last I looked I don't think it was going to be that bad, just needed a bit of
>>>>> finesse. We just need to be able to partially process a request in e.g.
>>>>> nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as
>>>>> will fit instead of popping an assertion.
>>>>
>>>> I think it may not be doable.
>>>>
>>>> blk_rq_map_sg() is called by drivers and has to work on single request, however
>>>> more requests have to be involved if we delay the splitting to blk_rq_map_sg().
>>>> Cause splitting means that two bios can't be submitted in single IO request.
>>>
>>> Of course it's doable, do I have to show you how?
>>
>> No, you don't have to, could you just point out where my above words is wrong?
> 
> blk_rq_map_sg() _currently_ works on a single request, but as I said
> from the start that this would involve changing it to only process as
> much of a request as would fit on an sglist.
> 
> Drivers will have to be modified, but the changes to driver code
> should be pretty easy. What will be slightly trickier will be changing
> blk-mq to handle requests that are only partially completed; that will
> be harder than it would have been before blk-mq, since the old request
> queue code used to handle partially completed requests - not much work
> would have to be done that code.
> 
> I'm not very familiar with the blk-mq code, so Jens would be better
> qualified to say how best to change that code. The basic idea would
> probably be the same as how bios how have a refcount - bi_remaining -
> to track splits/completions. If requests (in blk-mq land) don't have
> such a refcount (they don't appear to), it will have to be added.
> 
>  From a quick glance, blk_mq_complete_request() is where the refcount
>  put will have to be added. I haven't found where requests are popped
>  off the request queue in blk-mq land yet - the code will have to be
>  changed to only do that once the request has been fully mapped and
>  submitted by the driver.

This is where my knee jerk at the initial "partial completions" and
"should be trivial" start to kick in. I don't think they are necessarily
hard, but they aren't free either. And you'd need to be paying that
atomic_dec cost for every IO. Maybe that's cheaper than the work we
currently have to do, maybe not... If it's a clear win, then it'd be an
interesting path to pursue. But we probably won't have that answer until
at least a hacky version is done as proof of concept.

On the upside, it'd simplify things to just have the mapping in one
place, when the request is setup. Though until all drivers do that
(which I worry will be never), then we'd be stuck with both. Maybe
that's a bit to pessimistic, should be easier now since we just have
blk-mq.
Ming Lei Nov. 5, 2019, 2:46 a.m. UTC | #15
On Mon, Nov 04, 2019 at 09:30:02PM -0500, Kent Overstreet wrote:
> On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote:
> > On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote:
> > > On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote:
> > > > On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote:
> > > > > On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote:
> > > > > > On 11/4/19 11:17 AM, Kent Overstreet wrote:
> > > > > > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
> > > > > > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
> > > > > > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> > > > > > >>>> __blk_queue_split() may be a bit heavy for small block size(such as
> > > > > > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> > > > > > >>>> multiple page. And only consider to try splitting this bio in case
> > > > > > >>>> that the multiple page flag is set.
> > > > > > >>>
> > > > > > >>> So, back in the day I had an alternative approach in mind: get rid of
> > > > > > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when
> > > > > > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
> > > > > > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
> > > > > > >>> queue.
> > > > > > >>>
> > > > > > >>> This would mean there'd be no need for counting segments at all, and would cut a
> > > > > > >>> fair amount of code out of the io path.
> > > > > > >>
> > > > > > >> I thought about that to, but it will take a lot more effort.  Mostly
> > > > > > >> because md/dm heavily rely on splitting as well.  I still think it is
> > > > > > >> worthwhile, it will just take a significant amount of time and we
> > > > > > >> should have the quick improvement now.
> > > > > > > 
> > > > > > > We can do it one driver at a time - driver sets a flag to disable
> > > > > > > blk_queue_split(). Obvious one to do first would be nvme since that's where it
> > > > > > > shows up the most.
> > > > > > > 
> > > > > > > And md/md do splitting internally, but I'm not so sure they need
> > > > > > > blk_queue_split().
> > > > > > 
> > > > > > I'm a big proponent of doing something like that instead, but it is a
> > > > > > lot of work. I absolutely hate the splitting we're doing now, even
> > > > > > though the original "let's work as hard as we add add page time to get
> > > > > > things right" was pretty abysmal as well.
> > > > > 
> > > > > Last I looked I don't think it was going to be that bad, just needed a bit of
> > > > > finesse. We just need to be able to partially process a request in e.g.
> > > > > nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as
> > > > > will fit instead of popping an assertion.
> > > > 
> > > > I think it may not be doable.
> > > > 
> > > > blk_rq_map_sg() is called by drivers and has to work on single request, however
> > > > more requests have to be involved if we delay the splitting to blk_rq_map_sg().
> > > > Cause splitting means that two bios can't be submitted in single IO request.
> > > 
> > > Of course it's doable, do I have to show you how?
> > 
> > No, you don't have to, could you just point out where my above words is wrong?
> 
> blk_rq_map_sg() _currently_ works on a single request, but as I said from the
> start that this would involve changing it to only process as much of a request
> as would fit on an sglist.

> Drivers will have to be modified, but the changes to driver code should be
> pretty easy. What will be slightly trickier will be changing blk-mq to handle
> requests that are only partially completed; that will be harder than it would
> have been before blk-mq, since the old request queue code used to handle
> partially completed requests - not much work would have to be done that code.

Looks you are suggesting partial request completion.

Then the biggest effect could be in performance, this change will cause the
whole FS bio is handled part by part serially, instead of submitting all
splitted bios(part) concurrently.

So sounds you are suggesting to fix one performance issue by causing new perf
issue, is that doable?


Thanks,
Ming
Jens Axboe Nov. 5, 2019, 2:49 a.m. UTC | #16
On 11/4/19 7:46 PM, Ming Lei wrote:
> On Mon, Nov 04, 2019 at 09:30:02PM -0500, Kent Overstreet wrote:
>> On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote:
>>> On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote:
>>>> On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote:
>>>>> On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote:
>>>>>> On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote:
>>>>>>> On 11/4/19 11:17 AM, Kent Overstreet wrote:
>>>>>>>> On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
>>>>>>>>> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
>>>>>>>>>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
>>>>>>>>>>> __blk_queue_split() may be a bit heavy for small block size(such as
>>>>>>>>>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
>>>>>>>>>>> multiple page. And only consider to try splitting this bio in case
>>>>>>>>>>> that the multiple page flag is set.
>>>>>>>>>>
>>>>>>>>>> So, back in the day I had an alternative approach in mind: get rid of
>>>>>>>>>> blk_queue_split entirely, by pushing splitting down to the request layer - when
>>>>>>>>>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
>>>>>>>>>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
>>>>>>>>>> queue.
>>>>>>>>>>
>>>>>>>>>> This would mean there'd be no need for counting segments at all, and would cut a
>>>>>>>>>> fair amount of code out of the io path.
>>>>>>>>>
>>>>>>>>> I thought about that to, but it will take a lot more effort.  Mostly
>>>>>>>>> because md/dm heavily rely on splitting as well.  I still think it is
>>>>>>>>> worthwhile, it will just take a significant amount of time and we
>>>>>>>>> should have the quick improvement now.
>>>>>>>>
>>>>>>>> We can do it one driver at a time - driver sets a flag to disable
>>>>>>>> blk_queue_split(). Obvious one to do first would be nvme since that's where it
>>>>>>>> shows up the most.
>>>>>>>>
>>>>>>>> And md/md do splitting internally, but I'm not so sure they need
>>>>>>>> blk_queue_split().
>>>>>>>
>>>>>>> I'm a big proponent of doing something like that instead, but it is a
>>>>>>> lot of work. I absolutely hate the splitting we're doing now, even
>>>>>>> though the original "let's work as hard as we add add page time to get
>>>>>>> things right" was pretty abysmal as well.
>>>>>>
>>>>>> Last I looked I don't think it was going to be that bad, just needed a bit of
>>>>>> finesse. We just need to be able to partially process a request in e.g.
>>>>>> nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as
>>>>>> will fit instead of popping an assertion.
>>>>>
>>>>> I think it may not be doable.
>>>>>
>>>>> blk_rq_map_sg() is called by drivers and has to work on single request, however
>>>>> more requests have to be involved if we delay the splitting to blk_rq_map_sg().
>>>>> Cause splitting means that two bios can't be submitted in single IO request.
>>>>
>>>> Of course it's doable, do I have to show you how?
>>>
>>> No, you don't have to, could you just point out where my above words is wrong?
>>
>> blk_rq_map_sg() _currently_ works on a single request, but as I said from the
>> start that this would involve changing it to only process as much of a request
>> as would fit on an sglist.
> 
>> Drivers will have to be modified, but the changes to driver code should be
>> pretty easy. What will be slightly trickier will be changing blk-mq to handle
>> requests that are only partially completed; that will be harder than it would
>> have been before blk-mq, since the old request queue code used to handle
>> partially completed requests - not much work would have to be done that code.
> 
> Looks you are suggesting partial request completion.
> 
> Then the biggest effect could be in performance, this change will cause the
> whole FS bio is handled part by part serially, instead of submitting all
> splitted bios(part) concurrently.
> 
> So sounds you are suggesting to fix one performance issue by causing new perf
> issue, is that doable?

It does seem like a rat hole of sorts. Because then you start adding code to
guesstimate how big the request could roughly be, and if you miss a bit,
you get a request that's tiny in between the normal sized ones.

Or you'd clone, and then you could still have them inflight in parallel.
But then you're paying the cost of that...
Kent Overstreet Nov. 5, 2019, 3:14 a.m. UTC | #17
On Mon, Nov 04, 2019 at 07:38:42PM -0700, Jens Axboe wrote:
> This is where my knee jerk at the initial "partial completions" and
> "should be trivial" start to kick in. I don't think they are necessarily
> hard, but they aren't free either. And you'd need to be paying that
> atomic_dec cost for every IO.

No need - you added the code to avoid that atomic dec for bi_remaining in the
common case, the same approach will work here.

> currently have to do, maybe not... If it's a clear win, then it'd be an
> interesting path to pursue. But we probably won't have that answer until
> at least a hacky version is done as proof of concept.
> 
> On the upside, it'd simplify things to just have the mapping in one
> place, when the request is setup. Though until all drivers do that
> (which I worry will be never), then we'd be stuck with both. Maybe
> that's a bit to pessimistic, should be easier now since we just have
> blk-mq.

blk_rq_map_sg isn't called from _that_ many places, I suspect once it's figured
out for one driver the rest won't be that bad.

And even if some drivers remain unconverted, I personally _much_ prefer this
approach to more special case fast paths, and I bet this approach will be faster
anyways.

Also - regarding driver allocating of the sglists, I think most high performance
drivers preallocate a pool of sglists that are all sized to what the device is
capable of.
Jens Axboe Nov. 5, 2019, 3:33 a.m. UTC | #18
On 11/4/19 8:14 PM, Kent Overstreet wrote:
> On Mon, Nov 04, 2019 at 07:38:42PM -0700, Jens Axboe wrote:
>> This is where my knee jerk at the initial "partial completions" and
>> "should be trivial" start to kick in. I don't think they are
>> necessarily hard, but they aren't free either. And you'd need to be
>> paying that atomic_dec cost for every IO.
> 
> No need - you added the code to avoid that atomic dec for bi_remaining
> in the common case, the same approach will work here.

I guess that would work for the common case of not splitting. If we split,
then it's OK to pay a higher cost. We would have anyway, with the
existing code.

>> currently have to do, maybe not... If it's a clear win, then it'd be
>> an interesting path to pursue. But we probably won't have that answer
>> until at least a hacky version is done as proof of concept.
>>
>> On the upside, it'd simplify things to just have the mapping in one
>> place, when the request is setup. Though until all drivers do that
>> (which I worry will be never), then we'd be stuck with both. Maybe
>> that's a bit to pessimistic, should be easier now since we just have
>> blk-mq.
> 
> blk_rq_map_sg isn't called from _that_ many places, I suspect once
> it's figured out for one driver the rest won't be that bad.

It's definitely easier than it would have been, most things are pretty
streamlined now with the blk-mq conversion. And the ones that don't call
blk_rq_map_sg() usually don't do DMA on the requests. Sizes tend to be
more arbitrary there, and not hard boundaries.

> And even if some drivers remain unconverted, I personally _much_
> prefer this approach to more special case fast paths, and I bet this
> approach will be faster anyways.
> 
> Also - regarding driver allocating of the sglists, I think most high
> performance drivers preallocate a pool of sglists that are all sized
> to what the device is capable of.

They do, but most of them probably also assume on sg list per request.
We'd have one request now instead of multiple, so either serializing it
(which would definitely suck for some common cases), or doing something
very funky with mapping to multiple requests at the same time.

But I don't think we should argue this much more. If someone wants to do
the work to make this work, even a prototype, it's much better to argue
over actual code then potential issues and wins now.
Ming Lei Nov. 5, 2019, 3:34 a.m. UTC | #19
On Mon, Nov 04, 2019 at 07:49:49PM -0700, Jens Axboe wrote:
> On 11/4/19 7:46 PM, Ming Lei wrote:
> > On Mon, Nov 04, 2019 at 09:30:02PM -0500, Kent Overstreet wrote:
> >> On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote:
> >>> On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote:
> >>>> On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote:
> >>>>> On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote:
> >>>>>> On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote:
> >>>>>>> On 11/4/19 11:17 AM, Kent Overstreet wrote:
> >>>>>>>> On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
> >>>>>>>>> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
> >>>>>>>>>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> >>>>>>>>>>> __blk_queue_split() may be a bit heavy for small block size(such as
> >>>>>>>>>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> >>>>>>>>>>> multiple page. And only consider to try splitting this bio in case
> >>>>>>>>>>> that the multiple page flag is set.
> >>>>>>>>>>
> >>>>>>>>>> So, back in the day I had an alternative approach in mind: get rid of
> >>>>>>>>>> blk_queue_split entirely, by pushing splitting down to the request layer - when
> >>>>>>>>>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
> >>>>>>>>>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
> >>>>>>>>>> queue.
> >>>>>>>>>>
> >>>>>>>>>> This would mean there'd be no need for counting segments at all, and would cut a
> >>>>>>>>>> fair amount of code out of the io path.
> >>>>>>>>>
> >>>>>>>>> I thought about that to, but it will take a lot more effort.  Mostly
> >>>>>>>>> because md/dm heavily rely on splitting as well.  I still think it is
> >>>>>>>>> worthwhile, it will just take a significant amount of time and we
> >>>>>>>>> should have the quick improvement now.
> >>>>>>>>
> >>>>>>>> We can do it one driver at a time - driver sets a flag to disable
> >>>>>>>> blk_queue_split(). Obvious one to do first would be nvme since that's where it
> >>>>>>>> shows up the most.
> >>>>>>>>
> >>>>>>>> And md/md do splitting internally, but I'm not so sure they need
> >>>>>>>> blk_queue_split().
> >>>>>>>
> >>>>>>> I'm a big proponent of doing something like that instead, but it is a
> >>>>>>> lot of work. I absolutely hate the splitting we're doing now, even
> >>>>>>> though the original "let's work as hard as we add add page time to get
> >>>>>>> things right" was pretty abysmal as well.
> >>>>>>
> >>>>>> Last I looked I don't think it was going to be that bad, just needed a bit of
> >>>>>> finesse. We just need to be able to partially process a request in e.g.
> >>>>>> nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as
> >>>>>> will fit instead of popping an assertion.
> >>>>>
> >>>>> I think it may not be doable.
> >>>>>
> >>>>> blk_rq_map_sg() is called by drivers and has to work on single request, however
> >>>>> more requests have to be involved if we delay the splitting to blk_rq_map_sg().
> >>>>> Cause splitting means that two bios can't be submitted in single IO request.
> >>>>
> >>>> Of course it's doable, do I have to show you how?
> >>>
> >>> No, you don't have to, could you just point out where my above words is wrong?
> >>
> >> blk_rq_map_sg() _currently_ works on a single request, but as I said from the
> >> start that this would involve changing it to only process as much of a request
> >> as would fit on an sglist.
> > 
> >> Drivers will have to be modified, but the changes to driver code should be
> >> pretty easy. What will be slightly trickier will be changing blk-mq to handle
> >> requests that are only partially completed; that will be harder than it would
> >> have been before blk-mq, since the old request queue code used to handle
> >> partially completed requests - not much work would have to be done that code.
> > 
> > Looks you are suggesting partial request completion.
> > 
> > Then the biggest effect could be in performance, this change will cause the
> > whole FS bio is handled part by part serially, instead of submitting all
> > splitted bios(part) concurrently.
> > 
> > So sounds you are suggesting to fix one performance issue by causing new perf
> > issue, is that doable?
> 
> It does seem like a rat hole of sorts. Because then you start adding code to
> guesstimate how big the request could roughly be, and if you miss a bit,
> you get a request that's tiny in between the normal sized ones.
> 
> Or you'd clone, and then you could still have them inflight in parallel.
> But then you're paying the cost of that...

However, IO latency is often much longer than latency of bio fast clone,
which is just one fix-sized slab allocation, mostly should hit cache.

I understand the current issue should be the cost of blk_queue_split() in
case of no splitting is required. blk_bio_segment_split() has been
simple enough, looks we need to investigate why it is so slow since
it shouldn't be such in-efficient.

Thanks,
Ming
diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index 8f0ed6228fc5..eeb81679689b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -583,6 +583,8 @@  void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 	bio_set_flag(bio, BIO_CLONED);
 	if (bio_flagged(bio_src, BIO_THROTTLED))
 		bio_set_flag(bio, BIO_THROTTLED);
+	if (bio_flagged(bio_src, BIO_MULTI_PAGE))
+		bio_set_flag(bio, BIO_MULTI_PAGE);
 	bio->bi_opf = bio_src->bi_opf;
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_write_hint = bio_src->bi_write_hint;
@@ -757,6 +759,9 @@  bool __bio_try_merge_page(struct bio *bio, struct page *page,
 		if (page_is_mergeable(bv, page, len, off, same_page)) {
 			bv->bv_len += len;
 			bio->bi_iter.bi_size += len;
+
+			if (!*same_page)
+				bio_set_flag(bio, BIO_MULTI_PAGE);
 			return true;
 		}
 	}
@@ -789,6 +794,10 @@  void __bio_add_page(struct bio *bio, struct page *page,
 	bio->bi_iter.bi_size += len;
 	bio->bi_vcnt++;
 
+	if (!bio_flagged(bio, BIO_MULTI_PAGE) && (bio->bi_vcnt >= 2 ||
+				bv->bv_len > PAGE_SIZE))
+		bio_set_flag(bio, BIO_MULTI_PAGE);
+
 	if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page)))
 		bio_set_flag(bio, BIO_WORKINGSET);
 }
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 48e6725b32ee..b0670711dc54 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -309,7 +309,13 @@  void __blk_queue_split(struct request_queue *q, struct bio **bio,
 				nr_segs);
 		break;
 	default:
-		split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs);
+		if (!bio_flagged(*bio, BIO_MULTI_PAGE)) {
+			*nr_segs = 1;
+			split = NULL;
+		} else {
+			split = blk_bio_segment_split(q, *bio, &q->bio_split,
+					nr_segs);
+		}
 		break;
 	}
 
diff --git a/block/bounce.c b/block/bounce.c
index f8ed677a1bf7..4b18a2accccc 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -253,6 +253,9 @@  static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
 	bio->bi_iter.bi_sector	= bio_src->bi_iter.bi_sector;
 	bio->bi_iter.bi_size	= bio_src->bi_iter.bi_size;
 
+	if (bio_flagged(bio_src, BIO_MULTI_PAGE))
+		bio_set_flag(bio, BIO_MULTI_PAGE);
+
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 62fb917f7a4f..71f5cbb6fdd6 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -253,6 +253,8 @@  start:		bv->bv_len	= min_t(size_t, PAGE_SIZE - bv->bv_offset,
 
 		size -= bv->bv_len;
 	}
+
+	bio_set_flag(bio, BIO_MULTI_PAGE);
 }
 
 /**
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d688b96d1d63..10b9a3539716 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -220,6 +220,9 @@  enum {
 				 * throttling rules. Don't do it again. */
 	BIO_TRACE_COMPLETION,	/* bio_endio() should trace the final completion
 				 * of this bio. */
+	BIO_MULTI_PAGE = BIO_USER_MAPPED,
+				/* used for optimize small BS IO from FS, so
+				 * share the bit flag with passthrough IO */
 	BIO_QUEUE_ENTERED,	/* can use blk_queue_enter_live() */
 	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
 	BIO_FLAG_LAST