Message ID | 20200616005633.172804-1-harshadshirwadkar@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: add split_alignment for request queue | expand |
3 minor suggestions, and a larger question: 1) struct queue_limits already contains some alignment information; e.g. discard_alignment, io_opt; perhaps this would better belong there instead of in struct request_queue. (This would also move the Documentation to Documentation/ABI/testing/sysfs-block) 2) Perhaps in parallel to discard_alignment this should be called io_alignment. 3) Documentation has consistent typo 'requeust'->request Reading through this change, split_alignment appears similar in my mind to limits->io_opt. But it does not actually appear to be factored into bio splitting, although it's described as "A properly aligned multiple of optimal_io_size is the preferred request size for workloads where sustained throughput is desired"... probably I don't understand the intended use of limits->io_opt, but it makes me wonder if perhaps io_opt should be factored into bio splitting instead of adding a new parameter? I am not very familiar with this area, and I'd love to know more about what io_opt is for and how this proposal is different. Thanks! John Dorminy
John, > Reading through this change, split_alignment appears similar in my > mind to limits->io_opt. io_opt is a hint which requests that the submitter does not issue any I/Os that are larger than this size. io_opt is smaller than or equal to the device's hard limit for I/O size. In the standards space it's described as "issuing an I/O larger than this value may incur a processing penalty". So not quite what you're looking for. However, we have chunk_sectors queue limit which is honored for splits.
On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote: > This feature allows the user to control the alignment at which request > queue is allowed to split bios. Google CloudSQL's 16k user space > application expects that direct io writes aligned at 16k boundary in > the user-space are not split by kernel at non-16k boundaries. More > details about this feature can be found in CloudSQL's Cloud Next 2018 > presentation[1]. The underlying block device is capable of performing > 16k aligned writes atomically. Thus, this allows the user-space SQL > application to avoid double-writes (to protect against partial > failures) which are very costly provided that these writes are not > split at non-16k boundary by any underlying layers. > > We make use of Ext4's bigalloc feature to ensure that writes issued by > Ext4 are 16k aligned. But, 16K aligned data writes may get merged with > contiguous non-16k aligned Ext4 metadata writes. Such a write request > would be broken by the kernel only guaranteeing that the individually > split requests are physical block size aligned. > > We started observing a significant increase in 16k unaligned splits in > 5.4. Bisect points to commit 07173c3ec276cbb18dc0e0687d37d310e98a1480 > ("block: enable multipage bvecs"). This patch enables multipage bvecs > resulting in multiple 16k aligned writes issued by the user-space to > be merged into one big IO at first. Later, __blk_queue_split() splits > these IOs while trying to align individual split IOs to be physical > block size. > > Newly added split_alignment parameter is the alignment at which > requeust queue is allowed to split IO request. By default this > alignment is turned off and current behavior is unchanged. > Such alignment can be reached via q->limits.chunk_sectors, and you just need to expose it via sysfs and make it writable. Thanks, Ming
Thanks for the feedback everyone, I totally overlook chunk_sectors and it sounds like that's exactly what we want. I'll send another patch where that becomes writable. On Mon, Jun 15, 2020 at 7:40 PM Ming Lei <ming.lei@redhat.com> wrote: > > On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote: > > This feature allows the user to control the alignment at which request > > queue is allowed to split bios. Google CloudSQL's 16k user space > > application expects that direct io writes aligned at 16k boundary in > > the user-space are not split by kernel at non-16k boundaries. More > > details about this feature can be found in CloudSQL's Cloud Next 2018 > > presentation[1]. The underlying block device is capable of performing > > 16k aligned writes atomically. Thus, this allows the user-space SQL > > application to avoid double-writes (to protect against partial > > failures) which are very costly provided that these writes are not > > split at non-16k boundary by any underlying layers. > > > > We make use of Ext4's bigalloc feature to ensure that writes issued by > > Ext4 are 16k aligned. But, 16K aligned data writes may get merged with > > contiguous non-16k aligned Ext4 metadata writes. Such a write request > > would be broken by the kernel only guaranteeing that the individually > > split requests are physical block size aligned. > > > > We started observing a significant increase in 16k unaligned splits in > > 5.4. Bisect points to commit 07173c3ec276cbb18dc0e0687d37d310e98a1480 > > ("block: enable multipage bvecs"). This patch enables multipage bvecs > > resulting in multiple 16k aligned writes issued by the user-space to > > be merged into one big IO at first. Later, __blk_queue_split() splits > > these IOs while trying to align individual split IOs to be physical > > block size. > > > > Newly added split_alignment parameter is the alignment at which > > requeust queue is allowed to split IO request. By default this > > alignment is turned off and current behavior is unchanged. > > > > Such alignment can be reached via q->limits.chunk_sectors, and you > just need to expose it via sysfs and make it writable. > > Thanks, > Ming >
After taking a closer look, I don't see chunk_sectors being accounted for in the splitting code. If I understand correctly, the implementation of chunk_sectors is such that it allows for one big IO to go through. In other words, it tries to avoid the splitting code if possible. But, it doesn't seem to guarantee that when an IO is split, it will be split at configured alignment. With commit 07173c3ec276cbb18dc0e0687d37d310e98a1480 ("block: enable multipage bvecs"), we now see bigger multi-page IOs. So, if an app sent out multiple writes, those all would get combined into one big IO. "chunk_requests" would allow that one big IO to go through *ONLY IF* the total size is < max_sectors. If that's not the case, chunk_sectors doesn't seem to have control on how this big IO will now be split. Please correct me if I'm wrong. On Mon, Jun 15, 2020 at 7:50 PM harshad shirwadkar <harshadshirwadkar@gmail.com> wrote: > > Thanks for the feedback everyone, I totally overlook chunk_sectors and > it sounds like that's exactly what we want. I'll send another patch > where that becomes writable. > > > On Mon, Jun 15, 2020 at 7:40 PM Ming Lei <ming.lei@redhat.com> wrote: > > > > On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote: > > > This feature allows the user to control the alignment at which request > > > queue is allowed to split bios. Google CloudSQL's 16k user space > > > application expects that direct io writes aligned at 16k boundary in > > > the user-space are not split by kernel at non-16k boundaries. More > > > details about this feature can be found in CloudSQL's Cloud Next 2018 > > > presentation[1]. The underlying block device is capable of performing > > > 16k aligned writes atomically. Thus, this allows the user-space SQL > > > application to avoid double-writes (to protect against partial > > > failures) which are very costly provided that these writes are not > > > split at non-16k boundary by any underlying layers. > > > > > > We make use of Ext4's bigalloc feature to ensure that writes issued by > > > Ext4 are 16k aligned. But, 16K aligned data writes may get merged with > > > contiguous non-16k aligned Ext4 metadata writes. Such a write request > > > would be broken by the kernel only guaranteeing that the individually > > > split requests are physical block size aligned. > > > > > > We started observing a significant increase in 16k unaligned splits in > > > 5.4. Bisect points to commit 07173c3ec276cbb18dc0e0687d37d310e98a1480 > > > ("block: enable multipage bvecs"). This patch enables multipage bvecs > > > resulting in multiple 16k aligned writes issued by the user-space to > > > be merged into one big IO at first. Later, __blk_queue_split() splits > > > these IOs while trying to align individual split IOs to be physical > > > block size. > > > > > > Newly added split_alignment parameter is the alignment at which > > > requeust queue is allowed to split IO request. By default this > > > alignment is turned off and current behavior is unchanged. > > > > > > > Such alignment can be reached via q->limits.chunk_sectors, and you > > just need to expose it via sysfs and make it writable. > > > > Thanks, > > Ming > >
On Mon, Jun 15, 2020 at 08:24:08PM -0700, harshad shirwadkar wrote: > After taking a closer look, I don't see chunk_sectors being accounted > for in the splitting code. If I understand correctly, the implementation > of chunk_sectors is such that it allows for one big IO to go through. > In other words, it tries to avoid the splitting code if possible. But, > it doesn't seem to guarantee that when an IO is split, it will be Please take a look at blk_max_size_offset() which is called by get_max_io_size() from blk_bio_segment_split(), which splits bio into chunk_sectors aligned bio. Also you may run IO trace on queue with chunk_sectors setup, and check if the splitted IO request is what you expected. Thanks, Ming
On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote: > This feature allows the user to control the alignment at which request > queue is allowed to split bios. Google CloudSQL's 16k user space > application expects that direct io writes aligned at 16k boundary in > the user-space are not split by kernel at non-16k boundaries. That is a completely bogus assumptions and people who think they can make that assumption should not be allowed to write software. Can we stop this crap now please?
Hi Christoph; Forgive me my newness and not understanding most subtleties, but I don't actually understand why you say "that is a completely bogus assumption" -- could you please elaborate or point me at a document/code/list archive explaining? Thanks in advance! John On Tue, Jun 16, 2020 at 3:30 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote: > > This feature allows the user to control the alignment at which request > > queue is allowed to split bios. Google CloudSQL's 16k user space > > application expects that direct io writes aligned at 16k boundary in > > the user-space are not split by kernel at non-16k boundaries. > > That is a completely bogus assumptions and people who think they can > make that assumption should not be allowed to write software. > > Can we stop this crap now please? >
On Tue, Jun 16, 2020 at 07:12:48AM -0400, Sweet Tea Dorminy wrote: > Hi Christoph; > > Forgive me my newness and not understanding most subtleties, but I > don't actually understand why you say "that is a completely bogus > assumption" -- could you please elaborate or point me at a > document/code/list archive explaining? There is absolutely no guarantee in what way the Linux kernel splits and merges I/O requests. Even more importantly there are absolutely no guarantees about the on-disk atomicy in the face of I/O failures.
diff --git a/Documentation/block/queue-sysfs.rst b/Documentation/block/queue-sysfs.rst index 6a8513af9201..c3eaba149415 100644 --- a/Documentation/block/queue-sysfs.rst +++ b/Documentation/block/queue-sysfs.rst @@ -251,4 +251,12 @@ devices are described in the ZBC (Zoned Block Commands) and ZAC do not support zone commands, they will be treated as regular block devices and zoned will report "none". +split_alignment (RW) +---------------------- +This is the alignment in bytes at which the requeust queue is allowed +to split IO requests. Once this value is set, the requeust queue +splits IOs such that the individual IOs are aligned to +split_alignment. The value of 0 indicates that an IO request can be +split anywhere. This value must be a power of 2. + Jens Axboe <jens.axboe@oracle.com>, February 2009 diff --git a/block/blk-merge.c b/block/blk-merge.c index 1534ed736363..cdf337c74b83 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -105,15 +105,18 @@ static struct bio *blk_bio_discard_split(struct request_queue *q, static struct bio *blk_bio_write_zeroes_split(struct request_queue *q, struct bio *bio, struct bio_set *bs, unsigned *nsegs) { + sector_t split; + *nsegs = 0; - if (!q->limits.max_write_zeroes_sectors) - return NULL; + split = q->limits.max_write_zeroes_sectors; + if (split && q->split_alignment >> 9) + split = round_down(split, q->split_alignment >> 9); - if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors) + if (!split || bio_sectors(bio) <= split) return NULL; - return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs); + return bio_split(bio, split, GFP_NOIO, bs); } static struct bio *blk_bio_write_same_split(struct request_queue *q, @@ -121,15 +124,18 @@ static struct bio *blk_bio_write_same_split(struct request_queue *q, struct bio_set *bs, unsigned *nsegs) { + sector_t split; + *nsegs = 1; - if (!q->limits.max_write_same_sectors) - return NULL; + split = q->limits.max_write_same_sectors; + if (split && q->split_alignment >> 9) + split = round_down(split, q->split_alignment >> 9); - if (bio_sectors(bio) <= q->limits.max_write_same_sectors) + if (!split || bio_sectors(bio) <= split) return NULL; - return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs); + return bio_split(bio, split, GFP_NOIO, bs); } /* @@ -248,7 +254,10 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, { struct bio_vec bv, bvprv, *bvprvp = NULL; struct bvec_iter iter; - unsigned nsegs = 0, sectors = 0; + unsigned int nsegs = 0, nsegs_aligned = 0; + unsigned int sectors = 0, sectors_aligned = 0, before = 0, after = 0; + unsigned int sector_alignment = + q->split_alignment ? (q->split_alignment >> 9) : 0; const unsigned max_sectors = get_max_io_size(q, bio); const unsigned max_segs = queue_max_segments(q); @@ -264,12 +273,31 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, sectors + (bv.bv_len >> 9) <= max_sectors && bv.bv_offset + bv.bv_len <= PAGE_SIZE) { nsegs++; - sectors += bv.bv_len >> 9; - } else if (bvec_split_segs(q, &bv, &nsegs, §ors, max_segs, - max_sectors)) { - goto split; + before = round_down(sectors, sector_alignment); + sectors += (bv.bv_len >> 9); + after = round_down(sectors, sector_alignment); + if (sector_alignment && before != after) { + /* This is a valid split point */ + nsegs_aligned = nsegs; + sectors_aligned = after; + } + goto next; } - + if (sector_alignment) { + before = round_down(sectors, sector_alignment); + after = round_down(sectors + (bv.bv_len >> 9), + sector_alignment); + if ((nsegs < max_segs) && before != after && + ((after - before) << 9) + bv.bv_offset <= PAGE_SIZE + && after <= max_sectors) { + sectors_aligned = after; + nsegs_aligned = nsegs + 1; + } + } + if (bvec_split_segs(q, &bv, &nsegs, §ors, max_segs, + max_sectors)) + goto split; +next: bvprv = bv; bvprvp = &bvprv; } @@ -278,7 +306,13 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, return NULL; split: *segs = nsegs; - return bio_split(bio, sectors, GFP_NOIO, bs); + if (sector_alignment && sectors_aligned == 0) + return NULL; + + *segs = sector_alignment ? nsegs_aligned : nsegs; + + return bio_split(bio, sector_alignment ? sectors_aligned : sectors, + GFP_NOIO, bs); } /** diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index fca9b158f4a0..f045c7a79a74 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -529,6 +529,29 @@ static ssize_t queue_dax_show(struct request_queue *q, char *page) return queue_var_show(blk_queue_dax(q), page); } +static ssize_t queue_split_alignment_show(struct request_queue *q, char *page) +{ + return queue_var_show(q->split_alignment, page); +} + +static ssize_t queue_split_alignment_store(struct request_queue *q, const char *page, + size_t count) +{ + unsigned long split_alignment; + int ret; + + ret = queue_var_store(&split_alignment, page, count); + if (ret < 0) + return ret; + + /* split_alignment can only be a power of 2 */ + if (split_alignment & (split_alignment - 1)) + return -EINVAL; + + q->split_alignment = split_alignment; + return count; +} + static struct queue_sysfs_entry queue_requests_entry = { .attr = {.name = "nr_requests", .mode = 0644 }, .show = queue_requests_show, @@ -727,6 +750,12 @@ static struct queue_sysfs_entry throtl_sample_time_entry = { }; #endif +static struct queue_sysfs_entry queue_split_alignment = { + .attr = {.name = "split_alignment", .mode = 0644 }, + .show = queue_split_alignment_show, + .store = queue_split_alignment_store, +}; + static struct attribute *queue_attrs[] = { &queue_requests_entry.attr, &queue_ra_entry.attr, @@ -766,6 +795,7 @@ static struct attribute *queue_attrs[] = { #ifdef CONFIG_BLK_DEV_THROTTLING_LOW &throtl_sample_time_entry.attr, #endif + &queue_split_alignment.attr, NULL, }; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 32868fbedc9e..e8feb43f6fdd 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -473,6 +473,7 @@ struct request_queue { void *dma_drain_buffer; unsigned int dma_pad_mask; unsigned int dma_alignment; + unsigned int split_alignment; unsigned int rq_timeout; int poll_nsec;
This feature allows the user to control the alignment at which request queue is allowed to split bios. Google CloudSQL's 16k user space application expects that direct io writes aligned at 16k boundary in the user-space are not split by kernel at non-16k boundaries. More details about this feature can be found in CloudSQL's Cloud Next 2018 presentation[1]. The underlying block device is capable of performing 16k aligned writes atomically. Thus, this allows the user-space SQL application to avoid double-writes (to protect against partial failures) which are very costly provided that these writes are not split at non-16k boundary by any underlying layers. We make use of Ext4's bigalloc feature to ensure that writes issued by Ext4 are 16k aligned. But, 16K aligned data writes may get merged with contiguous non-16k aligned Ext4 metadata writes. Such a write request would be broken by the kernel only guaranteeing that the individually split requests are physical block size aligned. We started observing a significant increase in 16k unaligned splits in 5.4. Bisect points to commit 07173c3ec276cbb18dc0e0687d37d310e98a1480 ("block: enable multipage bvecs"). This patch enables multipage bvecs resulting in multiple 16k aligned writes issued by the user-space to be merged into one big IO at first. Later, __blk_queue_split() splits these IOs while trying to align individual split IOs to be physical block size. Newly added split_alignment parameter is the alignment at which requeust queue is allowed to split IO request. By default this alignment is turned off and current behavior is unchanged. [1] CloudNext'18 "Optimizaing performance for Cloud SQL for MySQL" https://www.youtube.com/watch?v=gIeuiGg-_iw Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> --- Documentation/block/queue-sysfs.rst | 8 ++++ block/blk-merge.c | 64 ++++++++++++++++++++++------- block/blk-sysfs.c | 30 ++++++++++++++ include/linux/blkdev.h | 1 + 4 files changed, 88 insertions(+), 15 deletions(-)