diff mbox series

[01/10] block: don't decrement nr_phys_segments for physically contigous segments

Message ID 20190513063754.1520-2-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [01/10] block: don't decrement nr_phys_segments for physically contigous segments | expand

Commit Message

Christoph Hellwig May 13, 2019, 6:37 a.m. UTC
Currently ll_merge_requests_fn, unlike all other merge functions,
reduces nr_phys_segments by one if the last segment of the previous,
and the first segment of the next segement are contigous.  While this
seems like a nice solution to avoid building smaller than possible
requests it causes a mismatch between the segments actually present
in the request and those iterated over by the bvec iterators, including
__rq_for_each_bio.  This could cause overwrites of too small kmalloc
allocations in any driver using ranged discard, or also mistrigger
the single segment optimization in the nvme-pci driver.

We could possibly work around this by making the bvec iterators take
the front and back segment size into account, but that would require
moving them from the bio to the bio_iter and spreading this mess
over all users of bvecs.  Or we could simply remove this optimization
under the assumption that most users already build good enough bvecs,
and that the bio merge patch never cared about this optimization
either.  The latter is what this patch does.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-merge.c | 23 +----------------------
 1 file changed, 1 insertion(+), 22 deletions(-)

Comments

Ming Lei May 13, 2019, 9:45 a.m. UTC | #1
On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote:
> Currently ll_merge_requests_fn, unlike all other merge functions,
> reduces nr_phys_segments by one if the last segment of the previous,
> and the first segment of the next segement are contigous.  While this
> seems like a nice solution to avoid building smaller than possible

Some workloads need this optimization, please see 729204ef49ec00b
("block: relax check on sg gap"):

    If the last bvec of the 1st bio and the 1st bvec of the next
    bio are physically contigious, and the latter can be merged
    to last segment of the 1st bio, we should think they don't
    violate sg gap(or virt boundary) limit.

    Both Vitaly and Dexuan reported lots of unmergeable small bios
    are observed when running mkfs on Hyper-V virtual storage, and
    performance becomes quite low. This patch fixes that performance
    issue.

It can be observed that thousands of 512byte bios in one request when
running mkfs related workloads.

> requests it causes a mismatch between the segments actually present
> in the request and those iterated over by the bvec iterators, including
> __rq_for_each_bio.  This could cause overwrites of too small kmalloc

Request based drivers usually shouldn't iterate bio any more.


Thanks,
Ming
Christoph Hellwig May 13, 2019, 12:03 p.m. UTC | #2
On Mon, May 13, 2019 at 05:45:45PM +0800, Ming Lei wrote:
> On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote:
> > Currently ll_merge_requests_fn, unlike all other merge functions,
> > reduces nr_phys_segments by one if the last segment of the previous,
> > and the first segment of the next segement are contigous.  While this
> > seems like a nice solution to avoid building smaller than possible
> 
> Some workloads need this optimization, please see 729204ef49ec00b
> ("block: relax check on sg gap"):

And we still allow to merge the segments with this patch.  The only
difference is that these merges do get accounted as extra segments.

> > requests it causes a mismatch between the segments actually present
> > in the request and those iterated over by the bvec iterators, including
> > __rq_for_each_bio.  This could cause overwrites of too small kmalloc
> 
> Request based drivers usually shouldn't iterate bio any more.

We do that in a couple of places.  For one the nvme single segment
optimization that triggered this bug.  Also for range discard support
in nvme and virtio.  Then we have loop that  iterate the segments, but
doesn't use the nr_phys_segments count, and plenty of others that
iterate over pages at the moment but should be iterating bvecs,
e.g. ubd or aoe.
Christoph Hellwig May 13, 2019, 12:37 p.m. UTC | #3
On Mon, May 13, 2019 at 02:03:44PM +0200, Christoph Hellwig wrote:
> On Mon, May 13, 2019 at 05:45:45PM +0800, Ming Lei wrote:
> > On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote:
> > > Currently ll_merge_requests_fn, unlike all other merge functions,
> > > reduces nr_phys_segments by one if the last segment of the previous,
> > > and the first segment of the next segement are contigous.  While this
> > > seems like a nice solution to avoid building smaller than possible
> > 
> > Some workloads need this optimization, please see 729204ef49ec00b
> > ("block: relax check on sg gap"):
> 
> And we still allow to merge the segments with this patch.  The only
> difference is that these merges do get accounted as extra segments.

Trying mkfs.xfs there were no merges at all, which is expected as
it does perfectly sized direct I/O.

Trying mkfs.ext4 I see lots of bio merges.  But for those this patch
nothing changes at all, as we never decrement nr_phys_segments to start
with, we only every did that for request merges, and for those also
only for those on non-gappy devices due to the way the
req_gap_back_merge check was placed.
Ming Lei May 14, 2019, 4:36 a.m. UTC | #4
On Mon, May 13, 2019 at 02:03:44PM +0200, Christoph Hellwig wrote:
> On Mon, May 13, 2019 at 05:45:45PM +0800, Ming Lei wrote:
> > On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote:
> > > Currently ll_merge_requests_fn, unlike all other merge functions,
> > > reduces nr_phys_segments by one if the last segment of the previous,
> > > and the first segment of the next segement are contigous.  While this
> > > seems like a nice solution to avoid building smaller than possible
> > 
> > Some workloads need this optimization, please see 729204ef49ec00b
> > ("block: relax check on sg gap"):
> 
> And we still allow to merge the segments with this patch.  The only
> difference is that these merges do get accounted as extra segments.

It is easy for .nr_phys_segments to reach the max segment limit by this
way, then no new bio can be merged any more.

We don't consider segment merge between two bios in ll_new_hw_segment(),
in my mkfs test over virtio-blk, request size can be increased to ~1M(several
segments) from 63k(126 bios/segments) easily if the segment merge between
two bios is considered.

> 
> > > requests it causes a mismatch between the segments actually present
> > > in the request and those iterated over by the bvec iterators, including
> > > __rq_for_each_bio.  This could cause overwrites of too small kmalloc
> > 
> > Request based drivers usually shouldn't iterate bio any more.
> 
> We do that in a couple of places.  For one the nvme single segment
> optimization that triggered this bug.  Also for range discard support
> in nvme and virtio.  Then we have loop that  iterate the segments, but
> doesn't use the nr_phys_segments count, and plenty of others that
> iterate over pages at the moment but should be iterating bvecs,
> e.g. ubd or aoe.

Seems discard segment doesn't consider bios merge for nvme and virtio,
so it should be fine in this way. Will take a close look at nvme/virtio
discard segment merge later.


Thanks,
Ming
Christoph Hellwig May 14, 2019, 5:14 a.m. UTC | #5
On Tue, May 14, 2019 at 12:36:43PM +0800, Ming Lei wrote:
> > > Some workloads need this optimization, please see 729204ef49ec00b
> > > ("block: relax check on sg gap"):
> > 
> > And we still allow to merge the segments with this patch.  The only
> > difference is that these merges do get accounted as extra segments.
> 
> It is easy for .nr_phys_segments to reach the max segment limit by this
> way, then no new bio can be merged any more.

As said in my other mail we only decremented it for request merges
in the non-gap case before and no one complained.

> We don't consider segment merge between two bios in ll_new_hw_segment(),
> in my mkfs test over virtio-blk, request size can be increased to ~1M(several
> segments) from 63k(126 bios/segments) easily if the segment merge between
> two bios is considered.

With the gap devices we have unlimited segment size, see my next patch
to actually enforce that.  Which is much more efficient than using
multiple segments.  Also instead of hacking up the merge path even more
we can fix the block device buffered I/O path to submit large I/Os
instead of relying on merging like we do in the direct I/O code and every
major file system.  I have that on my plate as a todo list item.

> > We do that in a couple of places.  For one the nvme single segment
> > optimization that triggered this bug.  Also for range discard support
> > in nvme and virtio.  Then we have loop that  iterate the segments, but
> > doesn't use the nr_phys_segments count, and plenty of others that
> > iterate over pages at the moment but should be iterating bvecs,
> > e.g. ubd or aoe.
> 
> Seems discard segment doesn't consider bios merge for nvme and virtio,
> so it should be fine in this way. Will take a close look at nvme/virtio
> discard segment merge later.

I found the bio case by looking at doing the proper accounting in the
bio merge path and hitting KASAN warning due to the range kmalloc.
So that issue is real as well.
Ming Lei May 14, 2019, 9:05 a.m. UTC | #6
On Tue, May 14, 2019 at 07:14:41AM +0200, Christoph Hellwig wrote:
> On Tue, May 14, 2019 at 12:36:43PM +0800, Ming Lei wrote:
> > > > Some workloads need this optimization, please see 729204ef49ec00b
> > > > ("block: relax check on sg gap"):
> > > 
> > > And we still allow to merge the segments with this patch.  The only
> > > difference is that these merges do get accounted as extra segments.
> > 
> > It is easy for .nr_phys_segments to reach the max segment limit by this
> > way, then no new bio can be merged any more.
> 
> As said in my other mail we only decremented it for request merges
> in the non-gap case before and no one complained.

However we still may make it better, for example, the attached patch can
save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small
request(63KB) can be merged to big IO(1MB).

> 
> > We don't consider segment merge between two bios in ll_new_hw_segment(),
> > in my mkfs test over virtio-blk, request size can be increased to ~1M(several
> > segments) from 63k(126 bios/segments) easily if the segment merge between
> > two bios is considered.
> 
> With the gap devices we have unlimited segment size, see my next patch
> to actually enforce that.  Which is much more efficient than using

But this patch does effect on non-gap device, and actually most of
device are non-gap type.

> multiple segments.  Also instead of hacking up the merge path even more
> we can fix the block device buffered I/O path to submit large I/Os
> instead of relying on merging like we do in the direct I/O code and every
> major file system.  I have that on my plate as a todo list item.
> 
> > > We do that in a couple of places.  For one the nvme single segment
> > > optimization that triggered this bug.  Also for range discard support
> > > in nvme and virtio.  Then we have loop that  iterate the segments, but
> > > doesn't use the nr_phys_segments count, and plenty of others that
> > > iterate over pages at the moment but should be iterating bvecs,
> > > e.g. ubd or aoe.
> > 
> > Seems discard segment doesn't consider bios merge for nvme and virtio,
> > so it should be fine in this way. Will take a close look at nvme/virtio
> > discard segment merge later.
> 
> I found the bio case by looking at doing the proper accounting in the
> bio merge path and hitting KASAN warning due to the range kmalloc.
> So that issue is real as well.

I guess the following patch may fix it:

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index f1d90cd3dc47..a913110de389 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -175,7 +175,7 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
 
 static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
 {
-	unsigned short segments = blk_rq_nr_discard_segments(req);
+	unsigned short segments = 0;
 	unsigned short n = 0;
 	struct virtio_blk_discard_write_zeroes *range;
 	struct bio *bio;
@@ -184,6 +184,9 @@ static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
 	if (unmap)
 		flags |= VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP;
 
+	__rq_for_each_bio(bio, req)
+		segments++;
+
 	range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
 	if (!range)
 		return -ENOMEM;


Thanks,
Ming
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 21e87a714a73..af122efeb28d 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -622,6 +622,8 @@ static inline int ll_new_hw_segment(struct request_queue *q,
 				    struct bio *bio)
 {
 	int nr_phys_segs = bio_phys_segments(q, bio);
+	unsigned int seg_size =
+		req->biotail->bi_seg_back_size + bio->bi_seg_front_size;
 
 	if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(q))
 		goto no_merge;
@@ -629,11 +631,15 @@ static inline int ll_new_hw_segment(struct request_queue *q,
 	if (blk_integrity_merge_bio(q, req, bio) == false)
 		goto no_merge;
 
-	/*
-	 * This will form the start of a new hw segment.  Bump both
-	 * counters.
-	 */
-	req->nr_phys_segments += nr_phys_segs;
+	if (blk_phys_contig_segment(q, req->biotail, bio)) {
+		if (nr_phys_segs)
+			req->nr_phys_segments += nr_phys_segs - 1;
+		if (nr_phys_segs == 1)
+			bio->bi_seg_back_size = seg_size;
+		if (req->nr_phys_segments == 1)
+			req->bio->bi_seg_front_size = seg_size;
+	} else
+		req->nr_phys_segments += nr_phys_segs;
 	return 1;
 
 no_merge:
Christoph Hellwig May 14, 2019, 1:51 p.m. UTC | #7
On Tue, May 14, 2019 at 05:05:45PM +0800, Ming Lei wrote:
> However we still may make it better, for example, the attached patch can
> save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small
> request(63KB) can be merged to big IO(1MB).

And we could save even more time by making the block dev buffered I/O
path not do stupid things to start with.

> > With the gap devices we have unlimited segment size, see my next patch
> > to actually enforce that.  Which is much more efficient than using
> 
> But this patch does effect on non-gap device, and actually most of
> device are non-gap type.

Yes, but only for request merges, and only if merging the requests
goes over max_requests.  The upside is that we actually get a
nr_phys_segments that mirrors what is in the request, fixing bugs
in a few drivers, and allowing for follow on patches that significantly
simplify our I/O path.
Hannes Reinecke May 14, 2019, 1:57 p.m. UTC | #8
On 5/14/19 3:51 PM, Christoph Hellwig wrote:
> On Tue, May 14, 2019 at 05:05:45PM +0800, Ming Lei wrote:
>> However we still may make it better, for example, the attached patch can
>> save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small
>> request(63KB) can be merged to big IO(1MB).
> 
> And we could save even more time by making the block dev buffered I/O
> path not do stupid things to start with.
> 
>>> With the gap devices we have unlimited segment size, see my next patch
>>> to actually enforce that.  Which is much more efficient than using
>>
>> But this patch does effect on non-gap device, and actually most of
>> device are non-gap type.
> 
> Yes, but only for request merges, and only if merging the requests
> goes over max_requests.  The upside is that we actually get a
> nr_phys_segments that mirrors what is in the request, fixing bugs
> in a few drivers, and allowing for follow on patches that significantly
> simplify our I/O path.
> 
And we've seen several of these crashes in real life; signature is 
'Kernel oops at drivers/scsi/scsi_lib.c:1003'.
So I'm really glad that this one is finally being addressed.

Thanks Christoph!

Cheers,

Hannes
Ming Lei May 14, 2019, 2:27 p.m. UTC | #9
On Tue, May 14, 2019 at 03:51:42PM +0200, Christoph Hellwig wrote:
> On Tue, May 14, 2019 at 05:05:45PM +0800, Ming Lei wrote:
> > However we still may make it better, for example, the attached patch can
> > save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small
> > request(63KB) can be merged to big IO(1MB).
> 
> And we could save even more time by making the block dev buffered I/O
> path not do stupid things to start with.

I am wondering if it can be done easily, given mkfs is userspace
which only calls write syscall on block device. Or could you share
something about how to fix the stupid things?

> 
> > > With the gap devices we have unlimited segment size, see my next patch
> > > to actually enforce that.  Which is much more efficient than using
> > 
> > But this patch does effect on non-gap device, and actually most of
> > device are non-gap type.
> 
> Yes, but only for request merges, and only if merging the requests
> goes over max_requests.  The upside is that we actually get a
> nr_phys_segments that mirrors what is in the request, fixing bugs
> in a few drivers, and allowing for follow on patches that significantly
> simplify our I/O path.

non-gap device still has max segment size limit, and I guess it still
needs to be respected.


Thanks,
Ming
Christoph Hellwig May 14, 2019, 2:31 p.m. UTC | #10
On Tue, May 14, 2019 at 10:27:17PM +0800, Ming Lei wrote:
> I am wondering if it can be done easily, given mkfs is userspace
> which only calls write syscall on block device. Or could you share
> something about how to fix the stupid things?

mkfs.ext4 at least uses buffered I/O on the block device.  And the
block device uses the really old buffer head based address_space ops,
which will submit one bio per buffer_head, that is per logic block.
mkfs probably writes much larger sizes than that..

> > nr_phys_segments that mirrors what is in the request, fixing bugs
> > in a few drivers, and allowing for follow on patches that significantly
> > simplify our I/O path.
> 
> non-gap device still has max segment size limit, and I guess it still
> needs to be respected.

It is still respected.  It might just disallow some merges we could
theoretically allow.
Christoph Hellwig May 14, 2019, 2:32 p.m. UTC | #11
On Tue, May 14, 2019 at 04:31:02PM +0200, Christoph Hellwig wrote:
> On Tue, May 14, 2019 at 10:27:17PM +0800, Ming Lei wrote:
> > I am wondering if it can be done easily, given mkfs is userspace
> > which only calls write syscall on block device. Or could you share
> > something about how to fix the stupid things?
> 
> mkfs.ext4 at least uses buffered I/O on the block device.  And the
> block device uses the really old buffer head based address_space ops,
> which will submit one bio per buffer_head, that is per logic block.
> mkfs probably writes much larger sizes than that..

As a first step we could try something like that patch below.  Although
the mpage ops still aren't exactly optimal:


diff --git a/fs/block_dev.c b/fs/block_dev.c
index bded2ee3788d..b2ee74f1c669 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -608,12 +608,12 @@ EXPORT_SYMBOL(thaw_bdev);
 
 static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
 {
-	return block_write_full_page(page, blkdev_get_block, wbc);
+	return mpage_writepage(page, blkdev_get_block, wbc);
 }
 
 static int blkdev_readpage(struct file * file, struct page * page)
 {
-	return block_read_full_page(page, blkdev_get_block);
+	return mpage_readpage(page, blkdev_get_block);
 }
 
 static int blkdev_readpages(struct file *file, struct address_space *mapping,
@@ -1984,7 +1984,7 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
 static int blkdev_writepages(struct address_space *mapping,
 			     struct writeback_control *wbc)
 {
-	return generic_writepages(mapping, wbc);
+	return mpage_writepages(mapping, wbc, blkdev_get_block);
 }
 
 static const struct address_space_operations def_blk_aops = {
diff mbox series

Patch

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 21e87a714a73..80a5a0facb87 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -358,7 +358,6 @@  static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	unsigned front_seg_size;
 	struct bio *fbio, *bbio;
 	struct bvec_iter iter;
-	bool new_bio = false;
 
 	if (!bio)
 		return 0;
@@ -379,31 +378,12 @@  static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	nr_phys_segs = 0;
 	for_each_bio(bio) {
 		bio_for_each_bvec(bv, bio, iter) {
-			if (new_bio) {
-				if (seg_size + bv.bv_len
-				    > queue_max_segment_size(q))
-					goto new_segment;
-				if (!biovec_phys_mergeable(q, &bvprv, &bv))
-					goto new_segment;
-
-				seg_size += bv.bv_len;
-
-				if (nr_phys_segs == 1 && seg_size >
-						front_seg_size)
-					front_seg_size = seg_size;
-
-				continue;
-			}
-new_segment:
 			bvec_split_segs(q, &bv, &nr_phys_segs, &seg_size,
 					&front_seg_size, NULL, UINT_MAX);
-			new_bio = false;
 		}
 		bbio = bio;
-		if (likely(bio->bi_iter.bi_size)) {
+		if (likely(bio->bi_iter.bi_size))
 			bvprv = bv;
-			new_bio = true;
-		}
 	}
 
 	fbio->bi_seg_front_size = front_seg_size;
@@ -725,7 +705,6 @@  static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
 			req->bio->bi_seg_front_size = seg_size;
 		if (next->nr_phys_segments == 1)
 			next->biotail->bi_seg_back_size = seg_size;
-		total_phys_segments--;
 	}
 
 	if (total_phys_segments > queue_max_segments(q))