Message ID | 20191102072911.24817-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [V4] block: optimize for small block size IO | expand |
On 11/2/19 1:29 AM, Ming Lei wrote: > __blk_queue_split() may be a bit heavy for small block size(such as > 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > multiple page. And only consider to try splitting this bio in case > that the multiple page flag is set. > > ~3% - 5% IOPS improvement can be observed on io_uring test over > null_blk(MQ), and the io_uring test code is from fio/t/io_uring.c > > bch_bio_map() should be the only one which doesn't use bio_add_page(), > so force to mark bio built via bch_bio_map() as MULTI_PAGE. > > RAID5 has similar usage too, however the bio is really single-page bio, > so not necessary to handle it. Thanks Ming, applied.
On 11/2/19 8:03 AM, Jens Axboe wrote: > On 11/2/19 1:29 AM, Ming Lei wrote: >> __blk_queue_split() may be a bit heavy for small block size(such as >> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes >> multiple page. And only consider to try splitting this bio in case >> that the multiple page flag is set. >> >> ~3% - 5% IOPS improvement can be observed on io_uring test over >> null_blk(MQ), and the io_uring test code is from fio/t/io_uring.c >> >> bch_bio_map() should be the only one which doesn't use bio_add_page(), >> so force to mark bio built via bch_bio_map() as MULTI_PAGE. >> >> RAID5 has similar usage too, however the bio is really single-page bio, >> so not necessary to handle it. > > Thanks Ming, applied. Actually, I took a closer look at this. I thought the BIO_MAP_USER overload would be ok, but that seems potentially fragile and so does the fact that we need to now maintain an extra state for multipage. Any serious objections to just doing the somewhat hacky bio->bi_vcnt check? With a comment I think that's more acceptable, and it doesn't rely on maintaining extra state. Particularly the latter is a big win, imho.
On Sat, Nov 2, 2019 at 11:58 PM Jens Axboe <axboe@kernel.dk> wrote: > > On 11/2/19 8:03 AM, Jens Axboe wrote: > > On 11/2/19 1:29 AM, Ming Lei wrote: > >> __blk_queue_split() may be a bit heavy for small block size(such as > >> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > >> multiple page. And only consider to try splitting this bio in case > >> that the multiple page flag is set. > >> > >> ~3% - 5% IOPS improvement can be observed on io_uring test over > >> null_blk(MQ), and the io_uring test code is from fio/t/io_uring.c > >> > >> bch_bio_map() should be the only one which doesn't use bio_add_page(), > >> so force to mark bio built via bch_bio_map() as MULTI_PAGE. > >> > >> RAID5 has similar usage too, however the bio is really single-page bio, > >> so not necessary to handle it. > > > > Thanks Ming, applied. > > Actually, I took a closer look at this. I thought the BIO_MAP_USER > overload would be ok, but that seems potentially fragile and so does > the fact that we need to now maintain an extra state for multipage. > Any serious objections to just doing the somewhat hacky bio->bi_vcnt > check? With a comment I think that's more acceptable, and it doesn't > rely on maintaining extra state. Particularly the latter is a big > win, imho. I am fine with checking bio->bi_vcnt with comment. Thanks, Ming Lei
On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > __blk_queue_split() may be a bit heavy for small block size(such as > 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > multiple page. And only consider to try splitting this bio in case > that the multiple page flag is set. So, back in the day I had an alternative approach in mind: get rid of blk_queue_split entirely, by pushing splitting down to the request layer - when we map the bio/request to sgl, just have it map as much as will fit in the sgl and if it doesn't entirely fit bump bi_remaining and leave it on the request queue. This would mean there'd be no need for counting segments at all, and would cut a fair amount of code out of the io path.
On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: > On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > > __blk_queue_split() may be a bit heavy for small block size(such as > > 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > > multiple page. And only consider to try splitting this bio in case > > that the multiple page flag is set. > > So, back in the day I had an alternative approach in mind: get rid of > blk_queue_split entirely, by pushing splitting down to the request layer - when > we map the bio/request to sgl, just have it map as much as will fit in the sgl > and if it doesn't entirely fit bump bi_remaining and leave it on the request > queue. > > This would mean there'd be no need for counting segments at all, and would cut a > fair amount of code out of the io path. I thought about that to, but it will take a lot more effort. Mostly because md/dm heavily rely on splitting as well. I still think it is worthwhile, it will just take a significant amount of time and we should have the quick improvement now.
On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote: > On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: > > On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > > > __blk_queue_split() may be a bit heavy for small block size(such as > > > 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > > > multiple page. And only consider to try splitting this bio in case > > > that the multiple page flag is set. > > > > So, back in the day I had an alternative approach in mind: get rid of > > blk_queue_split entirely, by pushing splitting down to the request layer - when > > we map the bio/request to sgl, just have it map as much as will fit in the sgl > > and if it doesn't entirely fit bump bi_remaining and leave it on the request > > queue. > > > > This would mean there'd be no need for counting segments at all, and would cut a > > fair amount of code out of the io path. > > I thought about that to, but it will take a lot more effort. Mostly > because md/dm heavily rely on splitting as well. I still think it is > worthwhile, it will just take a significant amount of time and we > should have the quick improvement now. We can do it one driver at a time - driver sets a flag to disable blk_queue_split(). Obvious one to do first would be nvme since that's where it shows up the most. And md/md do splitting internally, but I'm not so sure they need blk_queue_split().
On 11/4/19 11:17 AM, Kent Overstreet wrote: > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote: >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: >>>> __blk_queue_split() may be a bit heavy for small block size(such as >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes >>>> multiple page. And only consider to try splitting this bio in case >>>> that the multiple page flag is set. >>> >>> So, back in the day I had an alternative approach in mind: get rid of >>> blk_queue_split entirely, by pushing splitting down to the request layer - when >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request >>> queue. >>> >>> This would mean there'd be no need for counting segments at all, and would cut a >>> fair amount of code out of the io path. >> >> I thought about that to, but it will take a lot more effort. Mostly >> because md/dm heavily rely on splitting as well. I still think it is >> worthwhile, it will just take a significant amount of time and we >> should have the quick improvement now. > > We can do it one driver at a time - driver sets a flag to disable > blk_queue_split(). Obvious one to do first would be nvme since that's where it > shows up the most. > > And md/md do splitting internally, but I'm not so sure they need > blk_queue_split(). I'm a big proponent of doing something like that instead, but it is a lot of work. I absolutely hate the splitting we're doing now, even though the original "let's work as hard as we add add page time to get things right" was pretty abysmal as well.
On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote: > On 11/4/19 11:17 AM, Kent Overstreet wrote: > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote: > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > >>>> __blk_queue_split() may be a bit heavy for small block size(such as > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > >>>> multiple page. And only consider to try splitting this bio in case > >>>> that the multiple page flag is set. > >>> > >>> So, back in the day I had an alternative approach in mind: get rid of > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request > >>> queue. > >>> > >>> This would mean there'd be no need for counting segments at all, and would cut a > >>> fair amount of code out of the io path. > >> > >> I thought about that to, but it will take a lot more effort. Mostly > >> because md/dm heavily rely on splitting as well. I still think it is > >> worthwhile, it will just take a significant amount of time and we > >> should have the quick improvement now. > > > > We can do it one driver at a time - driver sets a flag to disable > > blk_queue_split(). Obvious one to do first would be nvme since that's where it > > shows up the most. > > > > And md/md do splitting internally, but I'm not so sure they need > > blk_queue_split(). > > I'm a big proponent of doing something like that instead, but it is a > lot of work. I absolutely hate the splitting we're doing now, even > though the original "let's work as hard as we add add page time to get > things right" was pretty abysmal as well. Last I looked I don't think it was going to be that bad, just needed a bit of finesse. We just need to be able to partially process a request in e.g. nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as will fit instead of popping an assertion.
On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: > On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > > __blk_queue_split() may be a bit heavy for small block size(such as > > 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > > multiple page. And only consider to try splitting this bio in case > > that the multiple page flag is set. > > So, back in the day I had an alternative approach in mind: get rid of > blk_queue_split entirely, by pushing splitting down to the request layer - when > we map the bio/request to sgl, just have it map as much as will fit in the sgl > and if it doesn't entirely fit bump bi_remaining and leave it on the request > queue. Many drivers don't need sgl via blk_rq_map_sg(), but still need to split bio. > > This would mean there'd be no need for counting segments at all, and would cut a > fair amount of code out of the io path. No counting segments involved in this small block size case, the handling in blk_bio_segment_split() should be simple enough, still not understand why IOPS drop is observable. Thanks, Ming
On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote: > On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote: > > On 11/4/19 11:17 AM, Kent Overstreet wrote: > > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote: > > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: > > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > > >>>> __blk_queue_split() may be a bit heavy for small block size(such as > > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > > >>>> multiple page. And only consider to try splitting this bio in case > > >>>> that the multiple page flag is set. > > >>> > > >>> So, back in the day I had an alternative approach in mind: get rid of > > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when > > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl > > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request > > >>> queue. > > >>> > > >>> This would mean there'd be no need for counting segments at all, and would cut a > > >>> fair amount of code out of the io path. > > >> > > >> I thought about that to, but it will take a lot more effort. Mostly > > >> because md/dm heavily rely on splitting as well. I still think it is > > >> worthwhile, it will just take a significant amount of time and we > > >> should have the quick improvement now. > > > > > > We can do it one driver at a time - driver sets a flag to disable > > > blk_queue_split(). Obvious one to do first would be nvme since that's where it > > > shows up the most. > > > > > > And md/md do splitting internally, but I'm not so sure they need > > > blk_queue_split(). > > > > I'm a big proponent of doing something like that instead, but it is a > > lot of work. I absolutely hate the splitting we're doing now, even > > though the original "let's work as hard as we add add page time to get > > things right" was pretty abysmal as well. > > Last I looked I don't think it was going to be that bad, just needed a bit of > finesse. We just need to be able to partially process a request in e.g. > nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as > will fit instead of popping an assertion. I think it may not be doable. blk_rq_map_sg() is called by drivers and has to work on single request, however more requests have to be involved if we delay the splitting to blk_rq_map_sg(). Cause splitting means that two bios can't be submitted in single IO request. Thanks, Ming
On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote: > On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote: > > On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote: > > > On 11/4/19 11:17 AM, Kent Overstreet wrote: > > > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote: > > > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: > > > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > > > >>>> __blk_queue_split() may be a bit heavy for small block size(such as > > > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > > > >>>> multiple page. And only consider to try splitting this bio in case > > > >>>> that the multiple page flag is set. > > > >>> > > > >>> So, back in the day I had an alternative approach in mind: get rid of > > > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when > > > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl > > > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request > > > >>> queue. > > > >>> > > > >>> This would mean there'd be no need for counting segments at all, and would cut a > > > >>> fair amount of code out of the io path. > > > >> > > > >> I thought about that to, but it will take a lot more effort. Mostly > > > >> because md/dm heavily rely on splitting as well. I still think it is > > > >> worthwhile, it will just take a significant amount of time and we > > > >> should have the quick improvement now. > > > > > > > > We can do it one driver at a time - driver sets a flag to disable > > > > blk_queue_split(). Obvious one to do first would be nvme since that's where it > > > > shows up the most. > > > > > > > > And md/md do splitting internally, but I'm not so sure they need > > > > blk_queue_split(). > > > > > > I'm a big proponent of doing something like that instead, but it is a > > > lot of work. I absolutely hate the splitting we're doing now, even > > > though the original "let's work as hard as we add add page time to get > > > things right" was pretty abysmal as well. > > > > Last I looked I don't think it was going to be that bad, just needed a bit of > > finesse. We just need to be able to partially process a request in e.g. > > nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as > > will fit instead of popping an assertion. > > I think it may not be doable. > > blk_rq_map_sg() is called by drivers and has to work on single request, however > more requests have to be involved if we delay the splitting to blk_rq_map_sg(). > Cause splitting means that two bios can't be submitted in single IO request. Of course it's doable, do I have to show you how?
On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote: > On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote: > > On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote: > > > On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote: > > > > On 11/4/19 11:17 AM, Kent Overstreet wrote: > > > > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote: > > > > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: > > > > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > > > > >>>> __blk_queue_split() may be a bit heavy for small block size(such as > > > > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > > > > >>>> multiple page. And only consider to try splitting this bio in case > > > > >>>> that the multiple page flag is set. > > > > >>> > > > > >>> So, back in the day I had an alternative approach in mind: get rid of > > > > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when > > > > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl > > > > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request > > > > >>> queue. > > > > >>> > > > > >>> This would mean there'd be no need for counting segments at all, and would cut a > > > > >>> fair amount of code out of the io path. > > > > >> > > > > >> I thought about that to, but it will take a lot more effort. Mostly > > > > >> because md/dm heavily rely on splitting as well. I still think it is > > > > >> worthwhile, it will just take a significant amount of time and we > > > > >> should have the quick improvement now. > > > > > > > > > > We can do it one driver at a time - driver sets a flag to disable > > > > > blk_queue_split(). Obvious one to do first would be nvme since that's where it > > > > > shows up the most. > > > > > > > > > > And md/md do splitting internally, but I'm not so sure they need > > > > > blk_queue_split(). > > > > > > > > I'm a big proponent of doing something like that instead, but it is a > > > > lot of work. I absolutely hate the splitting we're doing now, even > > > > though the original "let's work as hard as we add add page time to get > > > > things right" was pretty abysmal as well. > > > > > > Last I looked I don't think it was going to be that bad, just needed a bit of > > > finesse. We just need to be able to partially process a request in e.g. > > > nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as > > > will fit instead of popping an assertion. > > > > I think it may not be doable. > > > > blk_rq_map_sg() is called by drivers and has to work on single request, however > > more requests have to be involved if we delay the splitting to blk_rq_map_sg(). > > Cause splitting means that two bios can't be submitted in single IO request. > > Of course it's doable, do I have to show you how? No, you don't have to, could you just point out where my above words is wrong? Thanks, Ming
On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote: > On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote: > > On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote: > > > On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote: > > > > On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote: > > > > > On 11/4/19 11:17 AM, Kent Overstreet wrote: > > > > > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote: > > > > > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: > > > > > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > > > > > >>>> __blk_queue_split() may be a bit heavy for small block size(such as > > > > > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > > > > > >>>> multiple page. And only consider to try splitting this bio in case > > > > > >>>> that the multiple page flag is set. > > > > > >>> > > > > > >>> So, back in the day I had an alternative approach in mind: get rid of > > > > > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when > > > > > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl > > > > > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request > > > > > >>> queue. > > > > > >>> > > > > > >>> This would mean there'd be no need for counting segments at all, and would cut a > > > > > >>> fair amount of code out of the io path. > > > > > >> > > > > > >> I thought about that to, but it will take a lot more effort. Mostly > > > > > >> because md/dm heavily rely on splitting as well. I still think it is > > > > > >> worthwhile, it will just take a significant amount of time and we > > > > > >> should have the quick improvement now. > > > > > > > > > > > > We can do it one driver at a time - driver sets a flag to disable > > > > > > blk_queue_split(). Obvious one to do first would be nvme since that's where it > > > > > > shows up the most. > > > > > > > > > > > > And md/md do splitting internally, but I'm not so sure they need > > > > > > blk_queue_split(). > > > > > > > > > > I'm a big proponent of doing something like that instead, but it is a > > > > > lot of work. I absolutely hate the splitting we're doing now, even > > > > > though the original "let's work as hard as we add add page time to get > > > > > things right" was pretty abysmal as well. > > > > > > > > Last I looked I don't think it was going to be that bad, just needed a bit of > > > > finesse. We just need to be able to partially process a request in e.g. > > > > nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as > > > > will fit instead of popping an assertion. > > > > > > I think it may not be doable. > > > > > > blk_rq_map_sg() is called by drivers and has to work on single request, however > > > more requests have to be involved if we delay the splitting to blk_rq_map_sg(). > > > Cause splitting means that two bios can't be submitted in single IO request. > > > > Of course it's doable, do I have to show you how? > > No, you don't have to, could you just point out where my above words is wrong? blk_rq_map_sg() _currently_ works on a single request, but as I said from the start that this would involve changing it to only process as much of a request as would fit on an sglist. Drivers will have to be modified, but the changes to driver code should be pretty easy. What will be slightly trickier will be changing blk-mq to handle requests that are only partially completed; that will be harder than it would have been before blk-mq, since the old request queue code used to handle partially completed requests - not much work would have to be done that code. I'm not very familiar with the blk-mq code, so Jens would be better qualified to say how best to change that code. The basic idea would probably be the same as how bios how have a refcount - bi_remaining - to track splits/completions. If requests (in blk-mq land) don't have such a refcount (they don't appear to), it will have to be added. From a quick glance, blk_mq_complete_request() is where the refcount put will have to be added. I haven't found where requests are popped off the request queue in blk-mq land yet - the code will have to be changed to only do that once the request has been fully mapped and submitted by the driver.
On 11/4/19 7:30 PM, Kent Overstreet wrote: > On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote: >> On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote: >>> On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote: >>>> On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote: >>>>> On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote: >>>>>> On 11/4/19 11:17 AM, Kent Overstreet wrote: >>>>>>> On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote: >>>>>>>> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: >>>>>>>>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: >>>>>>>>>> __blk_queue_split() may be a bit heavy for small block size(such as >>>>>>>>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes >>>>>>>>>> multiple page. And only consider to try splitting this bio in case >>>>>>>>>> that the multiple page flag is set. >>>>>>>>> >>>>>>>>> So, back in the day I had an alternative approach in mind: get rid of >>>>>>>>> blk_queue_split entirely, by pushing splitting down to the request layer - when >>>>>>>>> we map the bio/request to sgl, just have it map as much as will fit in the sgl >>>>>>>>> and if it doesn't entirely fit bump bi_remaining and leave it on the request >>>>>>>>> queue. >>>>>>>>> >>>>>>>>> This would mean there'd be no need for counting segments at all, and would cut a >>>>>>>>> fair amount of code out of the io path. >>>>>>>> >>>>>>>> I thought about that to, but it will take a lot more effort. Mostly >>>>>>>> because md/dm heavily rely on splitting as well. I still think it is >>>>>>>> worthwhile, it will just take a significant amount of time and we >>>>>>>> should have the quick improvement now. >>>>>>> >>>>>>> We can do it one driver at a time - driver sets a flag to disable >>>>>>> blk_queue_split(). Obvious one to do first would be nvme since that's where it >>>>>>> shows up the most. >>>>>>> >>>>>>> And md/md do splitting internally, but I'm not so sure they need >>>>>>> blk_queue_split(). >>>>>> >>>>>> I'm a big proponent of doing something like that instead, but it is a >>>>>> lot of work. I absolutely hate the splitting we're doing now, even >>>>>> though the original "let's work as hard as we add add page time to get >>>>>> things right" was pretty abysmal as well. >>>>> >>>>> Last I looked I don't think it was going to be that bad, just needed a bit of >>>>> finesse. We just need to be able to partially process a request in e.g. >>>>> nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as >>>>> will fit instead of popping an assertion. >>>> >>>> I think it may not be doable. >>>> >>>> blk_rq_map_sg() is called by drivers and has to work on single request, however >>>> more requests have to be involved if we delay the splitting to blk_rq_map_sg(). >>>> Cause splitting means that two bios can't be submitted in single IO request. >>> >>> Of course it's doable, do I have to show you how? >> >> No, you don't have to, could you just point out where my above words is wrong? > > blk_rq_map_sg() _currently_ works on a single request, but as I said > from the start that this would involve changing it to only process as > much of a request as would fit on an sglist. > > Drivers will have to be modified, but the changes to driver code > should be pretty easy. What will be slightly trickier will be changing > blk-mq to handle requests that are only partially completed; that will > be harder than it would have been before blk-mq, since the old request > queue code used to handle partially completed requests - not much work > would have to be done that code. > > I'm not very familiar with the blk-mq code, so Jens would be better > qualified to say how best to change that code. The basic idea would > probably be the same as how bios how have a refcount - bi_remaining - > to track splits/completions. If requests (in blk-mq land) don't have > such a refcount (they don't appear to), it will have to be added. > > From a quick glance, blk_mq_complete_request() is where the refcount > put will have to be added. I haven't found where requests are popped > off the request queue in blk-mq land yet - the code will have to be > changed to only do that once the request has been fully mapped and > submitted by the driver. This is where my knee jerk at the initial "partial completions" and "should be trivial" start to kick in. I don't think they are necessarily hard, but they aren't free either. And you'd need to be paying that atomic_dec cost for every IO. Maybe that's cheaper than the work we currently have to do, maybe not... If it's a clear win, then it'd be an interesting path to pursue. But we probably won't have that answer until at least a hacky version is done as proof of concept. On the upside, it'd simplify things to just have the mapping in one place, when the request is setup. Though until all drivers do that (which I worry will be never), then we'd be stuck with both. Maybe that's a bit to pessimistic, should be easier now since we just have blk-mq.
On Mon, Nov 04, 2019 at 09:30:02PM -0500, Kent Overstreet wrote: > On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote: > > On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote: > > > On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote: > > > > On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote: > > > > > On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote: > > > > > > On 11/4/19 11:17 AM, Kent Overstreet wrote: > > > > > > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote: > > > > > > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: > > > > > > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > > > > > > >>>> __blk_queue_split() may be a bit heavy for small block size(such as > > > > > > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > > > > > > >>>> multiple page. And only consider to try splitting this bio in case > > > > > > >>>> that the multiple page flag is set. > > > > > > >>> > > > > > > >>> So, back in the day I had an alternative approach in mind: get rid of > > > > > > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when > > > > > > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl > > > > > > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request > > > > > > >>> queue. > > > > > > >>> > > > > > > >>> This would mean there'd be no need for counting segments at all, and would cut a > > > > > > >>> fair amount of code out of the io path. > > > > > > >> > > > > > > >> I thought about that to, but it will take a lot more effort. Mostly > > > > > > >> because md/dm heavily rely on splitting as well. I still think it is > > > > > > >> worthwhile, it will just take a significant amount of time and we > > > > > > >> should have the quick improvement now. > > > > > > > > > > > > > > We can do it one driver at a time - driver sets a flag to disable > > > > > > > blk_queue_split(). Obvious one to do first would be nvme since that's where it > > > > > > > shows up the most. > > > > > > > > > > > > > > And md/md do splitting internally, but I'm not so sure they need > > > > > > > blk_queue_split(). > > > > > > > > > > > > I'm a big proponent of doing something like that instead, but it is a > > > > > > lot of work. I absolutely hate the splitting we're doing now, even > > > > > > though the original "let's work as hard as we add add page time to get > > > > > > things right" was pretty abysmal as well. > > > > > > > > > > Last I looked I don't think it was going to be that bad, just needed a bit of > > > > > finesse. We just need to be able to partially process a request in e.g. > > > > > nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as > > > > > will fit instead of popping an assertion. > > > > > > > > I think it may not be doable. > > > > > > > > blk_rq_map_sg() is called by drivers and has to work on single request, however > > > > more requests have to be involved if we delay the splitting to blk_rq_map_sg(). > > > > Cause splitting means that two bios can't be submitted in single IO request. > > > > > > Of course it's doable, do I have to show you how? > > > > No, you don't have to, could you just point out where my above words is wrong? > > blk_rq_map_sg() _currently_ works on a single request, but as I said from the > start that this would involve changing it to only process as much of a request > as would fit on an sglist. > Drivers will have to be modified, but the changes to driver code should be > pretty easy. What will be slightly trickier will be changing blk-mq to handle > requests that are only partially completed; that will be harder than it would > have been before blk-mq, since the old request queue code used to handle > partially completed requests - not much work would have to be done that code. Looks you are suggesting partial request completion. Then the biggest effect could be in performance, this change will cause the whole FS bio is handled part by part serially, instead of submitting all splitted bios(part) concurrently. So sounds you are suggesting to fix one performance issue by causing new perf issue, is that doable? Thanks, Ming
On 11/4/19 7:46 PM, Ming Lei wrote: > On Mon, Nov 04, 2019 at 09:30:02PM -0500, Kent Overstreet wrote: >> On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote: >>> On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote: >>>> On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote: >>>>> On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote: >>>>>> On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote: >>>>>>> On 11/4/19 11:17 AM, Kent Overstreet wrote: >>>>>>>> On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote: >>>>>>>>> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: >>>>>>>>>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: >>>>>>>>>>> __blk_queue_split() may be a bit heavy for small block size(such as >>>>>>>>>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes >>>>>>>>>>> multiple page. And only consider to try splitting this bio in case >>>>>>>>>>> that the multiple page flag is set. >>>>>>>>>> >>>>>>>>>> So, back in the day I had an alternative approach in mind: get rid of >>>>>>>>>> blk_queue_split entirely, by pushing splitting down to the request layer - when >>>>>>>>>> we map the bio/request to sgl, just have it map as much as will fit in the sgl >>>>>>>>>> and if it doesn't entirely fit bump bi_remaining and leave it on the request >>>>>>>>>> queue. >>>>>>>>>> >>>>>>>>>> This would mean there'd be no need for counting segments at all, and would cut a >>>>>>>>>> fair amount of code out of the io path. >>>>>>>>> >>>>>>>>> I thought about that to, but it will take a lot more effort. Mostly >>>>>>>>> because md/dm heavily rely on splitting as well. I still think it is >>>>>>>>> worthwhile, it will just take a significant amount of time and we >>>>>>>>> should have the quick improvement now. >>>>>>>> >>>>>>>> We can do it one driver at a time - driver sets a flag to disable >>>>>>>> blk_queue_split(). Obvious one to do first would be nvme since that's where it >>>>>>>> shows up the most. >>>>>>>> >>>>>>>> And md/md do splitting internally, but I'm not so sure they need >>>>>>>> blk_queue_split(). >>>>>>> >>>>>>> I'm a big proponent of doing something like that instead, but it is a >>>>>>> lot of work. I absolutely hate the splitting we're doing now, even >>>>>>> though the original "let's work as hard as we add add page time to get >>>>>>> things right" was pretty abysmal as well. >>>>>> >>>>>> Last I looked I don't think it was going to be that bad, just needed a bit of >>>>>> finesse. We just need to be able to partially process a request in e.g. >>>>>> nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as >>>>>> will fit instead of popping an assertion. >>>>> >>>>> I think it may not be doable. >>>>> >>>>> blk_rq_map_sg() is called by drivers and has to work on single request, however >>>>> more requests have to be involved if we delay the splitting to blk_rq_map_sg(). >>>>> Cause splitting means that two bios can't be submitted in single IO request. >>>> >>>> Of course it's doable, do I have to show you how? >>> >>> No, you don't have to, could you just point out where my above words is wrong? >> >> blk_rq_map_sg() _currently_ works on a single request, but as I said from the >> start that this would involve changing it to only process as much of a request >> as would fit on an sglist. > >> Drivers will have to be modified, but the changes to driver code should be >> pretty easy. What will be slightly trickier will be changing blk-mq to handle >> requests that are only partially completed; that will be harder than it would >> have been before blk-mq, since the old request queue code used to handle >> partially completed requests - not much work would have to be done that code. > > Looks you are suggesting partial request completion. > > Then the biggest effect could be in performance, this change will cause the > whole FS bio is handled part by part serially, instead of submitting all > splitted bios(part) concurrently. > > So sounds you are suggesting to fix one performance issue by causing new perf > issue, is that doable? It does seem like a rat hole of sorts. Because then you start adding code to guesstimate how big the request could roughly be, and if you miss a bit, you get a request that's tiny in between the normal sized ones. Or you'd clone, and then you could still have them inflight in parallel. But then you're paying the cost of that...
On Mon, Nov 04, 2019 at 07:38:42PM -0700, Jens Axboe wrote: > This is where my knee jerk at the initial "partial completions" and > "should be trivial" start to kick in. I don't think they are necessarily > hard, but they aren't free either. And you'd need to be paying that > atomic_dec cost for every IO. No need - you added the code to avoid that atomic dec for bi_remaining in the common case, the same approach will work here. > currently have to do, maybe not... If it's a clear win, then it'd be an > interesting path to pursue. But we probably won't have that answer until > at least a hacky version is done as proof of concept. > > On the upside, it'd simplify things to just have the mapping in one > place, when the request is setup. Though until all drivers do that > (which I worry will be never), then we'd be stuck with both. Maybe > that's a bit to pessimistic, should be easier now since we just have > blk-mq. blk_rq_map_sg isn't called from _that_ many places, I suspect once it's figured out for one driver the rest won't be that bad. And even if some drivers remain unconverted, I personally _much_ prefer this approach to more special case fast paths, and I bet this approach will be faster anyways. Also - regarding driver allocating of the sglists, I think most high performance drivers preallocate a pool of sglists that are all sized to what the device is capable of.
On 11/4/19 8:14 PM, Kent Overstreet wrote: > On Mon, Nov 04, 2019 at 07:38:42PM -0700, Jens Axboe wrote: >> This is where my knee jerk at the initial "partial completions" and >> "should be trivial" start to kick in. I don't think they are >> necessarily hard, but they aren't free either. And you'd need to be >> paying that atomic_dec cost for every IO. > > No need - you added the code to avoid that atomic dec for bi_remaining > in the common case, the same approach will work here. I guess that would work for the common case of not splitting. If we split, then it's OK to pay a higher cost. We would have anyway, with the existing code. >> currently have to do, maybe not... If it's a clear win, then it'd be >> an interesting path to pursue. But we probably won't have that answer >> until at least a hacky version is done as proof of concept. >> >> On the upside, it'd simplify things to just have the mapping in one >> place, when the request is setup. Though until all drivers do that >> (which I worry will be never), then we'd be stuck with both. Maybe >> that's a bit to pessimistic, should be easier now since we just have >> blk-mq. > > blk_rq_map_sg isn't called from _that_ many places, I suspect once > it's figured out for one driver the rest won't be that bad. It's definitely easier than it would have been, most things are pretty streamlined now with the blk-mq conversion. And the ones that don't call blk_rq_map_sg() usually don't do DMA on the requests. Sizes tend to be more arbitrary there, and not hard boundaries. > And even if some drivers remain unconverted, I personally _much_ > prefer this approach to more special case fast paths, and I bet this > approach will be faster anyways. > > Also - regarding driver allocating of the sglists, I think most high > performance drivers preallocate a pool of sglists that are all sized > to what the device is capable of. They do, but most of them probably also assume on sg list per request. We'd have one request now instead of multiple, so either serializing it (which would definitely suck for some common cases), or doing something very funky with mapping to multiple requests at the same time. But I don't think we should argue this much more. If someone wants to do the work to make this work, even a prototype, it's much better to argue over actual code then potential issues and wins now.
On Mon, Nov 04, 2019 at 07:49:49PM -0700, Jens Axboe wrote: > On 11/4/19 7:46 PM, Ming Lei wrote: > > On Mon, Nov 04, 2019 at 09:30:02PM -0500, Kent Overstreet wrote: > >> On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote: > >>> On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote: > >>>> On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote: > >>>>> On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote: > >>>>>> On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote: > >>>>>>> On 11/4/19 11:17 AM, Kent Overstreet wrote: > >>>>>>>> On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote: > >>>>>>>>> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote: > >>>>>>>>>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > >>>>>>>>>>> __blk_queue_split() may be a bit heavy for small block size(such as > >>>>>>>>>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes > >>>>>>>>>>> multiple page. And only consider to try splitting this bio in case > >>>>>>>>>>> that the multiple page flag is set. > >>>>>>>>>> > >>>>>>>>>> So, back in the day I had an alternative approach in mind: get rid of > >>>>>>>>>> blk_queue_split entirely, by pushing splitting down to the request layer - when > >>>>>>>>>> we map the bio/request to sgl, just have it map as much as will fit in the sgl > >>>>>>>>>> and if it doesn't entirely fit bump bi_remaining and leave it on the request > >>>>>>>>>> queue. > >>>>>>>>>> > >>>>>>>>>> This would mean there'd be no need for counting segments at all, and would cut a > >>>>>>>>>> fair amount of code out of the io path. > >>>>>>>>> > >>>>>>>>> I thought about that to, but it will take a lot more effort. Mostly > >>>>>>>>> because md/dm heavily rely on splitting as well. I still think it is > >>>>>>>>> worthwhile, it will just take a significant amount of time and we > >>>>>>>>> should have the quick improvement now. > >>>>>>>> > >>>>>>>> We can do it one driver at a time - driver sets a flag to disable > >>>>>>>> blk_queue_split(). Obvious one to do first would be nvme since that's where it > >>>>>>>> shows up the most. > >>>>>>>> > >>>>>>>> And md/md do splitting internally, but I'm not so sure they need > >>>>>>>> blk_queue_split(). > >>>>>>> > >>>>>>> I'm a big proponent of doing something like that instead, but it is a > >>>>>>> lot of work. I absolutely hate the splitting we're doing now, even > >>>>>>> though the original "let's work as hard as we add add page time to get > >>>>>>> things right" was pretty abysmal as well. > >>>>>> > >>>>>> Last I looked I don't think it was going to be that bad, just needed a bit of > >>>>>> finesse. We just need to be able to partially process a request in e.g. > >>>>>> nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as > >>>>>> will fit instead of popping an assertion. > >>>>> > >>>>> I think it may not be doable. > >>>>> > >>>>> blk_rq_map_sg() is called by drivers and has to work on single request, however > >>>>> more requests have to be involved if we delay the splitting to blk_rq_map_sg(). > >>>>> Cause splitting means that two bios can't be submitted in single IO request. > >>>> > >>>> Of course it's doable, do I have to show you how? > >>> > >>> No, you don't have to, could you just point out where my above words is wrong? > >> > >> blk_rq_map_sg() _currently_ works on a single request, but as I said from the > >> start that this would involve changing it to only process as much of a request > >> as would fit on an sglist. > > > >> Drivers will have to be modified, but the changes to driver code should be > >> pretty easy. What will be slightly trickier will be changing blk-mq to handle > >> requests that are only partially completed; that will be harder than it would > >> have been before blk-mq, since the old request queue code used to handle > >> partially completed requests - not much work would have to be done that code. > > > > Looks you are suggesting partial request completion. > > > > Then the biggest effect could be in performance, this change will cause the > > whole FS bio is handled part by part serially, instead of submitting all > > splitted bios(part) concurrently. > > > > So sounds you are suggesting to fix one performance issue by causing new perf > > issue, is that doable? > > It does seem like a rat hole of sorts. Because then you start adding code to > guesstimate how big the request could roughly be, and if you miss a bit, > you get a request that's tiny in between the normal sized ones. > > Or you'd clone, and then you could still have them inflight in parallel. > But then you're paying the cost of that... However, IO latency is often much longer than latency of bio fast clone, which is just one fix-sized slab allocation, mostly should hit cache. I understand the current issue should be the cost of blk_queue_split() in case of no splitting is required. blk_bio_segment_split() has been simple enough, looks we need to investigate why it is so slow since it shouldn't be such in-efficient. Thanks, Ming
diff --git a/block/bio.c b/block/bio.c index 8f0ed6228fc5..eeb81679689b 100644 --- a/block/bio.c +++ b/block/bio.c @@ -583,6 +583,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src) bio_set_flag(bio, BIO_CLONED); if (bio_flagged(bio_src, BIO_THROTTLED)) bio_set_flag(bio, BIO_THROTTLED); + if (bio_flagged(bio_src, BIO_MULTI_PAGE)) + bio_set_flag(bio, BIO_MULTI_PAGE); bio->bi_opf = bio_src->bi_opf; bio->bi_ioprio = bio_src->bi_ioprio; bio->bi_write_hint = bio_src->bi_write_hint; @@ -757,6 +759,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, if (page_is_mergeable(bv, page, len, off, same_page)) { bv->bv_len += len; bio->bi_iter.bi_size += len; + + if (!*same_page) + bio_set_flag(bio, BIO_MULTI_PAGE); return true; } } @@ -789,6 +794,10 @@ void __bio_add_page(struct bio *bio, struct page *page, bio->bi_iter.bi_size += len; bio->bi_vcnt++; + if (!bio_flagged(bio, BIO_MULTI_PAGE) && (bio->bi_vcnt >= 2 || + bv->bv_len > PAGE_SIZE)) + bio_set_flag(bio, BIO_MULTI_PAGE); + if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page))) bio_set_flag(bio, BIO_WORKINGSET); } diff --git a/block/blk-merge.c b/block/blk-merge.c index 48e6725b32ee..b0670711dc54 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -309,7 +309,13 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio, nr_segs); break; default: - split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs); + if (!bio_flagged(*bio, BIO_MULTI_PAGE)) { + *nr_segs = 1; + split = NULL; + } else { + split = blk_bio_segment_split(q, *bio, &q->bio_split, + nr_segs); + } break; } diff --git a/block/bounce.c b/block/bounce.c index f8ed677a1bf7..4b18a2accccc 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -253,6 +253,9 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask, bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector; bio->bi_iter.bi_size = bio_src->bi_iter.bi_size; + if (bio_flagged(bio_src, BIO_MULTI_PAGE)) + bio_set_flag(bio, BIO_MULTI_PAGE); + switch (bio_op(bio)) { case REQ_OP_DISCARD: case REQ_OP_SECURE_ERASE: diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c index 62fb917f7a4f..71f5cbb6fdd6 100644 --- a/drivers/md/bcache/util.c +++ b/drivers/md/bcache/util.c @@ -253,6 +253,8 @@ start: bv->bv_len = min_t(size_t, PAGE_SIZE - bv->bv_offset, size -= bv->bv_len; } + + bio_set_flag(bio, BIO_MULTI_PAGE); } /** diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index d688b96d1d63..10b9a3539716 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -220,6 +220,9 @@ enum { * throttling rules. Don't do it again. */ BIO_TRACE_COMPLETION, /* bio_endio() should trace the final completion * of this bio. */ + BIO_MULTI_PAGE = BIO_USER_MAPPED, + /* used for optimize small BS IO from FS, so + * share the bit flag with passthrough IO */ BIO_QUEUE_ENTERED, /* can use blk_queue_enter_live() */ BIO_TRACKED, /* set if bio goes through the rq_qos path */ BIO_FLAG_LAST