diff mbox series

block: add split_alignment for request queue

Message ID 20200616005633.172804-1-harshadshirwadkar@gmail.com (mailing list archive)
State New, archived
Headers show
Series block: add split_alignment for request queue | expand

Commit Message

harshad shirwadkar June 16, 2020, 12:56 a.m. UTC
This feature allows the user to control the alignment at which request
queue is allowed to split bios. Google CloudSQL's 16k user space
application expects that direct io writes aligned at 16k boundary in
the user-space are not split by kernel at non-16k boundaries. More
details about this feature can be found in CloudSQL's Cloud Next 2018
presentation[1]. The underlying block device is capable of performing
16k aligned writes atomically. Thus, this allows the user-space SQL
application to avoid double-writes (to protect against partial
failures) which are very costly provided that these writes are not
split at non-16k boundary by any underlying layers.

We make use of Ext4's bigalloc feature to ensure that writes issued by
Ext4 are 16k aligned. But, 16K aligned data writes may get merged with
contiguous non-16k aligned Ext4 metadata writes. Such a write request
would be broken by the kernel only guaranteeing that the individually
split requests are physical block size aligned.

We started observing a significant increase in 16k unaligned splits in
5.4. Bisect points to commit 07173c3ec276cbb18dc0e0687d37d310e98a1480
("block: enable multipage bvecs"). This patch enables multipage bvecs
resulting in multiple 16k aligned writes issued by the user-space to
be merged into one big IO at first. Later, __blk_queue_split() splits
these IOs while trying to align individual split IOs to be physical
block size.

Newly added split_alignment parameter is the alignment at which
requeust queue is allowed to split IO request. By default this
alignment is turned off and current behavior is unchanged.

[1] CloudNext'18 "Optimizaing performance for Cloud SQL for MySQL"
    https://www.youtube.com/watch?v=gIeuiGg-_iw

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 Documentation/block/queue-sysfs.rst |  8 ++++
 block/blk-merge.c                   | 64 ++++++++++++++++++++++-------
 block/blk-sysfs.c                   | 30 ++++++++++++++
 include/linux/blkdev.h              |  1 +
 4 files changed, 88 insertions(+), 15 deletions(-)

Comments

John Dorminy June 16, 2020, 2:14 a.m. UTC | #1
3 minor suggestions, and a larger question:

1) struct queue_limits already contains some alignment information;
e.g. discard_alignment, io_opt; perhaps this would better belong there
instead of in struct request_queue. (This would also move the
Documentation to Documentation/ABI/testing/sysfs-block)
2) Perhaps in parallel to discard_alignment this should be called io_alignment.
3) Documentation has consistent typo 'requeust'->request

Reading through this change, split_alignment appears similar in my
mind to limits->io_opt. But it does not actually appear to be factored
into bio splitting, although it's described as "A properly aligned
multiple of optimal_io_size is the preferred request size for
workloads where sustained throughput is desired"... probably I don't
understand the intended use of limits->io_opt, but it makes me wonder
if perhaps io_opt should be factored into bio splitting instead of
adding a new parameter? I am not very familiar with this area, and I'd
love to know more about what io_opt is for and how this proposal is
different.

Thanks!

John Dorminy
Martin K. Petersen June 16, 2020, 2:25 a.m. UTC | #2
John,

> Reading through this change, split_alignment appears similar in my
> mind to limits->io_opt.

io_opt is a hint which requests that the submitter does not issue any
I/Os that are larger than this size. io_opt is smaller than or equal to
the device's hard limit for I/O size. In the standards space it's
described as "issuing an I/O larger than this value may incur a
processing penalty". So not quite what you're looking for.

However, we have chunk_sectors queue limit which is honored for splits.
Ming Lei June 16, 2020, 2:40 a.m. UTC | #3
On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote:
> This feature allows the user to control the alignment at which request
> queue is allowed to split bios. Google CloudSQL's 16k user space
> application expects that direct io writes aligned at 16k boundary in
> the user-space are not split by kernel at non-16k boundaries. More
> details about this feature can be found in CloudSQL's Cloud Next 2018
> presentation[1]. The underlying block device is capable of performing
> 16k aligned writes atomically. Thus, this allows the user-space SQL
> application to avoid double-writes (to protect against partial
> failures) which are very costly provided that these writes are not
> split at non-16k boundary by any underlying layers.
> 
> We make use of Ext4's bigalloc feature to ensure that writes issued by
> Ext4 are 16k aligned. But, 16K aligned data writes may get merged with
> contiguous non-16k aligned Ext4 metadata writes. Such a write request
> would be broken by the kernel only guaranteeing that the individually
> split requests are physical block size aligned.
> 
> We started observing a significant increase in 16k unaligned splits in
> 5.4. Bisect points to commit 07173c3ec276cbb18dc0e0687d37d310e98a1480
> ("block: enable multipage bvecs"). This patch enables multipage bvecs
> resulting in multiple 16k aligned writes issued by the user-space to
> be merged into one big IO at first. Later, __blk_queue_split() splits
> these IOs while trying to align individual split IOs to be physical
> block size.
> 
> Newly added split_alignment parameter is the alignment at which
> requeust queue is allowed to split IO request. By default this
> alignment is turned off and current behavior is unchanged.
> 

Such alignment can be reached via q->limits.chunk_sectors, and you
just need to expose it via sysfs and make it writable.

Thanks,
Ming
harshad shirwadkar June 16, 2020, 2:50 a.m. UTC | #4
Thanks for the feedback everyone, I totally overlook chunk_sectors and
it sounds like that's exactly what we want. I'll send another patch
where that becomes writable.


On Mon, Jun 15, 2020 at 7:40 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote:
> > This feature allows the user to control the alignment at which request
> > queue is allowed to split bios. Google CloudSQL's 16k user space
> > application expects that direct io writes aligned at 16k boundary in
> > the user-space are not split by kernel at non-16k boundaries. More
> > details about this feature can be found in CloudSQL's Cloud Next 2018
> > presentation[1]. The underlying block device is capable of performing
> > 16k aligned writes atomically. Thus, this allows the user-space SQL
> > application to avoid double-writes (to protect against partial
> > failures) which are very costly provided that these writes are not
> > split at non-16k boundary by any underlying layers.
> >
> > We make use of Ext4's bigalloc feature to ensure that writes issued by
> > Ext4 are 16k aligned. But, 16K aligned data writes may get merged with
> > contiguous non-16k aligned Ext4 metadata writes. Such a write request
> > would be broken by the kernel only guaranteeing that the individually
> > split requests are physical block size aligned.
> >
> > We started observing a significant increase in 16k unaligned splits in
> > 5.4. Bisect points to commit 07173c3ec276cbb18dc0e0687d37d310e98a1480
> > ("block: enable multipage bvecs"). This patch enables multipage bvecs
> > resulting in multiple 16k aligned writes issued by the user-space to
> > be merged into one big IO at first. Later, __blk_queue_split() splits
> > these IOs while trying to align individual split IOs to be physical
> > block size.
> >
> > Newly added split_alignment parameter is the alignment at which
> > requeust queue is allowed to split IO request. By default this
> > alignment is turned off and current behavior is unchanged.
> >
>
> Such alignment can be reached via q->limits.chunk_sectors, and you
> just need to expose it via sysfs and make it writable.
>
> Thanks,
> Ming
>
harshad shirwadkar June 16, 2020, 3:24 a.m. UTC | #5
After taking a closer look, I don't see chunk_sectors being accounted
for in the splitting code. If I understand correctly, the implementation
of chunk_sectors is such that it allows for one big IO to go through.
In other words, it tries to avoid the splitting code if possible. But,
it doesn't seem to guarantee that when an IO is split, it will be
split at configured alignment. With commit
07173c3ec276cbb18dc0e0687d37d310e98a1480 ("block: enable multipage
bvecs"), we now see bigger multi-page IOs. So, if an app sent out
multiple writes, those all would get combined into one big IO.
"chunk_requests" would allow that one big IO to go through *ONLY IF*
the total size is < max_sectors. If that's not the case, chunk_sectors
doesn't seem to have control on how this big IO will now be split.
Please correct me if I'm wrong.


On Mon, Jun 15, 2020 at 7:50 PM harshad shirwadkar
<harshadshirwadkar@gmail.com> wrote:
>
> Thanks for the feedback everyone, I totally overlook chunk_sectors and
> it sounds like that's exactly what we want. I'll send another patch
> where that becomes writable.
>
>
> On Mon, Jun 15, 2020 at 7:40 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote:
> > > This feature allows the user to control the alignment at which request
> > > queue is allowed to split bios. Google CloudSQL's 16k user space
> > > application expects that direct io writes aligned at 16k boundary in
> > > the user-space are not split by kernel at non-16k boundaries. More
> > > details about this feature can be found in CloudSQL's Cloud Next 2018
> > > presentation[1]. The underlying block device is capable of performing
> > > 16k aligned writes atomically. Thus, this allows the user-space SQL
> > > application to avoid double-writes (to protect against partial
> > > failures) which are very costly provided that these writes are not
> > > split at non-16k boundary by any underlying layers.
> > >
> > > We make use of Ext4's bigalloc feature to ensure that writes issued by
> > > Ext4 are 16k aligned. But, 16K aligned data writes may get merged with
> > > contiguous non-16k aligned Ext4 metadata writes. Such a write request
> > > would be broken by the kernel only guaranteeing that the individually
> > > split requests are physical block size aligned.
> > >
> > > We started observing a significant increase in 16k unaligned splits in
> > > 5.4. Bisect points to commit 07173c3ec276cbb18dc0e0687d37d310e98a1480
> > > ("block: enable multipage bvecs"). This patch enables multipage bvecs
> > > resulting in multiple 16k aligned writes issued by the user-space to
> > > be merged into one big IO at first. Later, __blk_queue_split() splits
> > > these IOs while trying to align individual split IOs to be physical
> > > block size.
> > >
> > > Newly added split_alignment parameter is the alignment at which
> > > requeust queue is allowed to split IO request. By default this
> > > alignment is turned off and current behavior is unchanged.
> > >
> >
> > Such alignment can be reached via q->limits.chunk_sectors, and you
> > just need to expose it via sysfs and make it writable.
> >
> > Thanks,
> > Ming
> >
Ming Lei June 16, 2020, 4:16 a.m. UTC | #6
On Mon, Jun 15, 2020 at 08:24:08PM -0700, harshad shirwadkar wrote:
> After taking a closer look, I don't see chunk_sectors being accounted
> for in the splitting code. If I understand correctly, the implementation
> of chunk_sectors is such that it allows for one big IO to go through.
> In other words, it tries to avoid the splitting code if possible. But,
> it doesn't seem to guarantee that when an IO is split, it will be

Please take a look at blk_max_size_offset() which is called by
get_max_io_size() from blk_bio_segment_split(), which splits
bio into chunk_sectors aligned bio.

Also you may run IO trace on queue with chunk_sectors setup, and check
if the splitted IO request is what you expected.


Thanks, 
Ming
Christoph Hellwig June 16, 2020, 7:29 a.m. UTC | #7
On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote:
> This feature allows the user to control the alignment at which request
> queue is allowed to split bios. Google CloudSQL's 16k user space
> application expects that direct io writes aligned at 16k boundary in
> the user-space are not split by kernel at non-16k boundaries.

That is a completely bogus assumptions and people who think they can
make that assumption should not be allowed to write software.

Can we stop this crap now please?
Sweet Tea Dorminy June 16, 2020, 11:12 a.m. UTC | #8
Hi Christoph;

Forgive me my newness and not understanding most subtleties, but I
don't actually understand why you say "that is a completely bogus
assumption" -- could you please elaborate or point me at a
document/code/list archive explaining?

Thanks in advance!

John

On Tue, Jun 16, 2020 at 3:30 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote:
> > This feature allows the user to control the alignment at which request
> > queue is allowed to split bios. Google CloudSQL's 16k user space
> > application expects that direct io writes aligned at 16k boundary in
> > the user-space are not split by kernel at non-16k boundaries.
>
> That is a completely bogus assumptions and people who think they can
> make that assumption should not be allowed to write software.
>
> Can we stop this crap now please?
>
Christoph Hellwig June 16, 2020, 2:38 p.m. UTC | #9
On Tue, Jun 16, 2020 at 07:12:48AM -0400, Sweet Tea Dorminy wrote:
> Hi Christoph;
> 
> Forgive me my newness and not understanding most subtleties, but I
> don't actually understand why you say "that is a completely bogus
> assumption" -- could you please elaborate or point me at a
> document/code/list archive explaining?

There is absolutely no guarantee in what way the Linux kernel splits
and merges I/O requests.  Even more importantly there are absolutely
no guarantees about the on-disk atomicy in the face of I/O failures.
diff mbox series

Patch

diff --git a/Documentation/block/queue-sysfs.rst b/Documentation/block/queue-sysfs.rst
index 6a8513af9201..c3eaba149415 100644
--- a/Documentation/block/queue-sysfs.rst
+++ b/Documentation/block/queue-sysfs.rst
@@ -251,4 +251,12 @@  devices are described in the ZBC (Zoned Block Commands) and ZAC
 do not support zone commands, they will be treated as regular block devices
 and zoned will report "none".
 
+split_alignment (RW)
+----------------------
+This is the alignment in bytes at which the requeust queue is allowed
+to split IO requests. Once this value is set, the requeust queue
+splits IOs such that the individual IOs are aligned to
+split_alignment. The value of 0 indicates that an IO request can be
+split anywhere. This value must be a power of 2.
+
 Jens Axboe <jens.axboe@oracle.com>, February 2009
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1534ed736363..cdf337c74b83 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -105,15 +105,18 @@  static struct bio *blk_bio_discard_split(struct request_queue *q,
 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
 {
+	sector_t split;
+
 	*nsegs = 0;
 
-	if (!q->limits.max_write_zeroes_sectors)
-		return NULL;
+	split = q->limits.max_write_zeroes_sectors;
+	if (split && q->split_alignment >> 9)
+		split = round_down(split, q->split_alignment >> 9);
 
-	if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+	if (!split || bio_sectors(bio) <= split)
 		return NULL;
 
-	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+	return bio_split(bio, split, GFP_NOIO, bs);
 }
 
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
@@ -121,15 +124,18 @@  static struct bio *blk_bio_write_same_split(struct request_queue *q,
 					    struct bio_set *bs,
 					    unsigned *nsegs)
 {
+	sector_t split;
+
 	*nsegs = 1;
 
-	if (!q->limits.max_write_same_sectors)
-		return NULL;
+	split = q->limits.max_write_same_sectors;
+	if (split && q->split_alignment >> 9)
+		split = round_down(split, q->split_alignment >> 9);
 
-	if (bio_sectors(bio) <= q->limits.max_write_same_sectors)
+	if (!split || bio_sectors(bio) <= split)
 		return NULL;
 
-	return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs);
+	return bio_split(bio, split, GFP_NOIO, bs);
 }
 
 /*
@@ -248,7 +254,10 @@  static struct bio *blk_bio_segment_split(struct request_queue *q,
 {
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
-	unsigned nsegs = 0, sectors = 0;
+	unsigned int nsegs = 0, nsegs_aligned = 0;
+	unsigned int sectors = 0, sectors_aligned = 0, before = 0, after = 0;
+	unsigned int sector_alignment =
+		q->split_alignment ? (q->split_alignment >> 9) : 0;
 	const unsigned max_sectors = get_max_io_size(q, bio);
 	const unsigned max_segs = queue_max_segments(q);
 
@@ -264,12 +273,31 @@  static struct bio *blk_bio_segment_split(struct request_queue *q,
 		    sectors + (bv.bv_len >> 9) <= max_sectors &&
 		    bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
 			nsegs++;
-			sectors += bv.bv_len >> 9;
-		} else if (bvec_split_segs(q, &bv, &nsegs, &sectors, max_segs,
-					 max_sectors)) {
-			goto split;
+			before = round_down(sectors, sector_alignment);
+			sectors += (bv.bv_len >> 9);
+			after = round_down(sectors, sector_alignment);
+			if (sector_alignment && before != after) {
+				/* This is a valid split point */
+				nsegs_aligned = nsegs;
+				sectors_aligned = after;
+			}
+			goto next;
 		}
-
+		if (sector_alignment) {
+			before = round_down(sectors, sector_alignment);
+			after = round_down(sectors + (bv.bv_len >> 9),
+					  sector_alignment);
+			if ((nsegs < max_segs) && before != after &&
+			    ((after - before) << 9) + bv.bv_offset <=  PAGE_SIZE
+			    && after <= max_sectors) {
+				sectors_aligned = after;
+				nsegs_aligned = nsegs + 1;
+			}
+		}
+		if (bvec_split_segs(q, &bv, &nsegs, &sectors, max_segs,
+				    max_sectors))
+			goto split;
+next:
 		bvprv = bv;
 		bvprvp = &bvprv;
 	}
@@ -278,7 +306,13 @@  static struct bio *blk_bio_segment_split(struct request_queue *q,
 	return NULL;
 split:
 	*segs = nsegs;
-	return bio_split(bio, sectors, GFP_NOIO, bs);
+	if (sector_alignment && sectors_aligned == 0)
+		return NULL;
+
+	*segs = sector_alignment ? nsegs_aligned : nsegs;
+
+	return bio_split(bio, sector_alignment ? sectors_aligned : sectors,
+			 GFP_NOIO, bs);
 }
 
 /**
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..f045c7a79a74 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -529,6 +529,29 @@  static ssize_t queue_dax_show(struct request_queue *q, char *page)
 	return queue_var_show(blk_queue_dax(q), page);
 }
 
+static ssize_t queue_split_alignment_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->split_alignment, page);
+}
+
+static ssize_t queue_split_alignment_store(struct request_queue *q, const char *page,
+						size_t count)
+{
+	unsigned long split_alignment;
+	int ret;
+
+	ret = queue_var_store(&split_alignment, page, count);
+	if (ret < 0)
+		return ret;
+
+	/* split_alignment can only be a power of 2 */
+	if (split_alignment & (split_alignment - 1))
+		return -EINVAL;
+
+	q->split_alignment = split_alignment;
+	return count;
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = 0644 },
 	.show = queue_requests_show,
@@ -727,6 +750,12 @@  static struct queue_sysfs_entry throtl_sample_time_entry = {
 };
 #endif
 
+static struct queue_sysfs_entry queue_split_alignment = {
+	.attr = {.name = "split_alignment", .mode = 0644 },
+	.show = queue_split_alignment_show,
+	.store = queue_split_alignment_store,
+};
+
 static struct attribute *queue_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -766,6 +795,7 @@  static struct attribute *queue_attrs[] = {
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 	&throtl_sample_time_entry.attr,
 #endif
+	&queue_split_alignment.attr,
 	NULL,
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32868fbedc9e..e8feb43f6fdd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -473,6 +473,7 @@  struct request_queue {
 	void			*dma_drain_buffer;
 	unsigned int		dma_pad_mask;
 	unsigned int		dma_alignment;
+	unsigned int		split_alignment;
 
 	unsigned int		rq_timeout;
 	int			poll_nsec;