Message ID | 20190513063754.1520-2-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/10] block: don't decrement nr_phys_segments for physically contigous segments | expand |
On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote: > Currently ll_merge_requests_fn, unlike all other merge functions, > reduces nr_phys_segments by one if the last segment of the previous, > and the first segment of the next segement are contigous. While this > seems like a nice solution to avoid building smaller than possible Some workloads need this optimization, please see 729204ef49ec00b ("block: relax check on sg gap"): If the last bvec of the 1st bio and the 1st bvec of the next bio are physically contigious, and the latter can be merged to last segment of the 1st bio, we should think they don't violate sg gap(or virt boundary) limit. Both Vitaly and Dexuan reported lots of unmergeable small bios are observed when running mkfs on Hyper-V virtual storage, and performance becomes quite low. This patch fixes that performance issue. It can be observed that thousands of 512byte bios in one request when running mkfs related workloads. > requests it causes a mismatch between the segments actually present > in the request and those iterated over by the bvec iterators, including > __rq_for_each_bio. This could cause overwrites of too small kmalloc Request based drivers usually shouldn't iterate bio any more. Thanks, Ming
On Mon, May 13, 2019 at 05:45:45PM +0800, Ming Lei wrote: > On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote: > > Currently ll_merge_requests_fn, unlike all other merge functions, > > reduces nr_phys_segments by one if the last segment of the previous, > > and the first segment of the next segement are contigous. While this > > seems like a nice solution to avoid building smaller than possible > > Some workloads need this optimization, please see 729204ef49ec00b > ("block: relax check on sg gap"): And we still allow to merge the segments with this patch. The only difference is that these merges do get accounted as extra segments. > > requests it causes a mismatch between the segments actually present > > in the request and those iterated over by the bvec iterators, including > > __rq_for_each_bio. This could cause overwrites of too small kmalloc > > Request based drivers usually shouldn't iterate bio any more. We do that in a couple of places. For one the nvme single segment optimization that triggered this bug. Also for range discard support in nvme and virtio. Then we have loop that iterate the segments, but doesn't use the nr_phys_segments count, and plenty of others that iterate over pages at the moment but should be iterating bvecs, e.g. ubd or aoe.
On Mon, May 13, 2019 at 02:03:44PM +0200, Christoph Hellwig wrote: > On Mon, May 13, 2019 at 05:45:45PM +0800, Ming Lei wrote: > > On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote: > > > Currently ll_merge_requests_fn, unlike all other merge functions, > > > reduces nr_phys_segments by one if the last segment of the previous, > > > and the first segment of the next segement are contigous. While this > > > seems like a nice solution to avoid building smaller than possible > > > > Some workloads need this optimization, please see 729204ef49ec00b > > ("block: relax check on sg gap"): > > And we still allow to merge the segments with this patch. The only > difference is that these merges do get accounted as extra segments. Trying mkfs.xfs there were no merges at all, which is expected as it does perfectly sized direct I/O. Trying mkfs.ext4 I see lots of bio merges. But for those this patch nothing changes at all, as we never decrement nr_phys_segments to start with, we only every did that for request merges, and for those also only for those on non-gappy devices due to the way the req_gap_back_merge check was placed.
On Mon, May 13, 2019 at 02:03:44PM +0200, Christoph Hellwig wrote: > On Mon, May 13, 2019 at 05:45:45PM +0800, Ming Lei wrote: > > On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote: > > > Currently ll_merge_requests_fn, unlike all other merge functions, > > > reduces nr_phys_segments by one if the last segment of the previous, > > > and the first segment of the next segement are contigous. While this > > > seems like a nice solution to avoid building smaller than possible > > > > Some workloads need this optimization, please see 729204ef49ec00b > > ("block: relax check on sg gap"): > > And we still allow to merge the segments with this patch. The only > difference is that these merges do get accounted as extra segments. It is easy for .nr_phys_segments to reach the max segment limit by this way, then no new bio can be merged any more. We don't consider segment merge between two bios in ll_new_hw_segment(), in my mkfs test over virtio-blk, request size can be increased to ~1M(several segments) from 63k(126 bios/segments) easily if the segment merge between two bios is considered. > > > > requests it causes a mismatch between the segments actually present > > > in the request and those iterated over by the bvec iterators, including > > > __rq_for_each_bio. This could cause overwrites of too small kmalloc > > > > Request based drivers usually shouldn't iterate bio any more. > > We do that in a couple of places. For one the nvme single segment > optimization that triggered this bug. Also for range discard support > in nvme and virtio. Then we have loop that iterate the segments, but > doesn't use the nr_phys_segments count, and plenty of others that > iterate over pages at the moment but should be iterating bvecs, > e.g. ubd or aoe. Seems discard segment doesn't consider bios merge for nvme and virtio, so it should be fine in this way. Will take a close look at nvme/virtio discard segment merge later. Thanks, Ming
On Tue, May 14, 2019 at 12:36:43PM +0800, Ming Lei wrote: > > > Some workloads need this optimization, please see 729204ef49ec00b > > > ("block: relax check on sg gap"): > > > > And we still allow to merge the segments with this patch. The only > > difference is that these merges do get accounted as extra segments. > > It is easy for .nr_phys_segments to reach the max segment limit by this > way, then no new bio can be merged any more. As said in my other mail we only decremented it for request merges in the non-gap case before and no one complained. > We don't consider segment merge between two bios in ll_new_hw_segment(), > in my mkfs test over virtio-blk, request size can be increased to ~1M(several > segments) from 63k(126 bios/segments) easily if the segment merge between > two bios is considered. With the gap devices we have unlimited segment size, see my next patch to actually enforce that. Which is much more efficient than using multiple segments. Also instead of hacking up the merge path even more we can fix the block device buffered I/O path to submit large I/Os instead of relying on merging like we do in the direct I/O code and every major file system. I have that on my plate as a todo list item. > > We do that in a couple of places. For one the nvme single segment > > optimization that triggered this bug. Also for range discard support > > in nvme and virtio. Then we have loop that iterate the segments, but > > doesn't use the nr_phys_segments count, and plenty of others that > > iterate over pages at the moment but should be iterating bvecs, > > e.g. ubd or aoe. > > Seems discard segment doesn't consider bios merge for nvme and virtio, > so it should be fine in this way. Will take a close look at nvme/virtio > discard segment merge later. I found the bio case by looking at doing the proper accounting in the bio merge path and hitting KASAN warning due to the range kmalloc. So that issue is real as well.
On Tue, May 14, 2019 at 07:14:41AM +0200, Christoph Hellwig wrote: > On Tue, May 14, 2019 at 12:36:43PM +0800, Ming Lei wrote: > > > > Some workloads need this optimization, please see 729204ef49ec00b > > > > ("block: relax check on sg gap"): > > > > > > And we still allow to merge the segments with this patch. The only > > > difference is that these merges do get accounted as extra segments. > > > > It is easy for .nr_phys_segments to reach the max segment limit by this > > way, then no new bio can be merged any more. > > As said in my other mail we only decremented it for request merges > in the non-gap case before and no one complained. However we still may make it better, for example, the attached patch can save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small request(63KB) can be merged to big IO(1MB). > > > We don't consider segment merge between two bios in ll_new_hw_segment(), > > in my mkfs test over virtio-blk, request size can be increased to ~1M(several > > segments) from 63k(126 bios/segments) easily if the segment merge between > > two bios is considered. > > With the gap devices we have unlimited segment size, see my next patch > to actually enforce that. Which is much more efficient than using But this patch does effect on non-gap device, and actually most of device are non-gap type. > multiple segments. Also instead of hacking up the merge path even more > we can fix the block device buffered I/O path to submit large I/Os > instead of relying on merging like we do in the direct I/O code and every > major file system. I have that on my plate as a todo list item. > > > > We do that in a couple of places. For one the nvme single segment > > > optimization that triggered this bug. Also for range discard support > > > in nvme and virtio. Then we have loop that iterate the segments, but > > > doesn't use the nr_phys_segments count, and plenty of others that > > > iterate over pages at the moment but should be iterating bvecs, > > > e.g. ubd or aoe. > > > > Seems discard segment doesn't consider bios merge for nvme and virtio, > > so it should be fine in this way. Will take a close look at nvme/virtio > > discard segment merge later. > > I found the bio case by looking at doing the proper accounting in the > bio merge path and hitting KASAN warning due to the range kmalloc. > So that issue is real as well. I guess the following patch may fix it: diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index f1d90cd3dc47..a913110de389 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -175,7 +175,7 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr, static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap) { - unsigned short segments = blk_rq_nr_discard_segments(req); + unsigned short segments = 0; unsigned short n = 0; struct virtio_blk_discard_write_zeroes *range; struct bio *bio; @@ -184,6 +184,9 @@ static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap) if (unmap) flags |= VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP; + __rq_for_each_bio(bio, req) + segments++; + range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC); if (!range) return -ENOMEM; Thanks, Ming diff --git a/block/blk-merge.c b/block/blk-merge.c index 21e87a714a73..af122efeb28d 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -622,6 +622,8 @@ static inline int ll_new_hw_segment(struct request_queue *q, struct bio *bio) { int nr_phys_segs = bio_phys_segments(q, bio); + unsigned int seg_size = + req->biotail->bi_seg_back_size + bio->bi_seg_front_size; if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(q)) goto no_merge; @@ -629,11 +631,15 @@ static inline int ll_new_hw_segment(struct request_queue *q, if (blk_integrity_merge_bio(q, req, bio) == false) goto no_merge; - /* - * This will form the start of a new hw segment. Bump both - * counters. - */ - req->nr_phys_segments += nr_phys_segs; + if (blk_phys_contig_segment(q, req->biotail, bio)) { + if (nr_phys_segs) + req->nr_phys_segments += nr_phys_segs - 1; + if (nr_phys_segs == 1) + bio->bi_seg_back_size = seg_size; + if (req->nr_phys_segments == 1) + req->bio->bi_seg_front_size = seg_size; + } else + req->nr_phys_segments += nr_phys_segs; return 1; no_merge:
On Tue, May 14, 2019 at 05:05:45PM +0800, Ming Lei wrote: > However we still may make it better, for example, the attached patch can > save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small > request(63KB) can be merged to big IO(1MB). And we could save even more time by making the block dev buffered I/O path not do stupid things to start with. > > With the gap devices we have unlimited segment size, see my next patch > > to actually enforce that. Which is much more efficient than using > > But this patch does effect on non-gap device, and actually most of > device are non-gap type. Yes, but only for request merges, and only if merging the requests goes over max_requests. The upside is that we actually get a nr_phys_segments that mirrors what is in the request, fixing bugs in a few drivers, and allowing for follow on patches that significantly simplify our I/O path.
On 5/14/19 3:51 PM, Christoph Hellwig wrote: > On Tue, May 14, 2019 at 05:05:45PM +0800, Ming Lei wrote: >> However we still may make it better, for example, the attached patch can >> save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small >> request(63KB) can be merged to big IO(1MB). > > And we could save even more time by making the block dev buffered I/O > path not do stupid things to start with. > >>> With the gap devices we have unlimited segment size, see my next patch >>> to actually enforce that. Which is much more efficient than using >> >> But this patch does effect on non-gap device, and actually most of >> device are non-gap type. > > Yes, but only for request merges, and only if merging the requests > goes over max_requests. The upside is that we actually get a > nr_phys_segments that mirrors what is in the request, fixing bugs > in a few drivers, and allowing for follow on patches that significantly > simplify our I/O path. > And we've seen several of these crashes in real life; signature is 'Kernel oops at drivers/scsi/scsi_lib.c:1003'. So I'm really glad that this one is finally being addressed. Thanks Christoph! Cheers, Hannes
On Tue, May 14, 2019 at 03:51:42PM +0200, Christoph Hellwig wrote: > On Tue, May 14, 2019 at 05:05:45PM +0800, Ming Lei wrote: > > However we still may make it better, for example, the attached patch can > > save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small > > request(63KB) can be merged to big IO(1MB). > > And we could save even more time by making the block dev buffered I/O > path not do stupid things to start with. I am wondering if it can be done easily, given mkfs is userspace which only calls write syscall on block device. Or could you share something about how to fix the stupid things? > > > > With the gap devices we have unlimited segment size, see my next patch > > > to actually enforce that. Which is much more efficient than using > > > > But this patch does effect on non-gap device, and actually most of > > device are non-gap type. > > Yes, but only for request merges, and only if merging the requests > goes over max_requests. The upside is that we actually get a > nr_phys_segments that mirrors what is in the request, fixing bugs > in a few drivers, and allowing for follow on patches that significantly > simplify our I/O path. non-gap device still has max segment size limit, and I guess it still needs to be respected. Thanks, Ming
On Tue, May 14, 2019 at 10:27:17PM +0800, Ming Lei wrote: > I am wondering if it can be done easily, given mkfs is userspace > which only calls write syscall on block device. Or could you share > something about how to fix the stupid things? mkfs.ext4 at least uses buffered I/O on the block device. And the block device uses the really old buffer head based address_space ops, which will submit one bio per buffer_head, that is per logic block. mkfs probably writes much larger sizes than that.. > > nr_phys_segments that mirrors what is in the request, fixing bugs > > in a few drivers, and allowing for follow on patches that significantly > > simplify our I/O path. > > non-gap device still has max segment size limit, and I guess it still > needs to be respected. It is still respected. It might just disallow some merges we could theoretically allow.
On Tue, May 14, 2019 at 04:31:02PM +0200, Christoph Hellwig wrote: > On Tue, May 14, 2019 at 10:27:17PM +0800, Ming Lei wrote: > > I am wondering if it can be done easily, given mkfs is userspace > > which only calls write syscall on block device. Or could you share > > something about how to fix the stupid things? > > mkfs.ext4 at least uses buffered I/O on the block device. And the > block device uses the really old buffer head based address_space ops, > which will submit one bio per buffer_head, that is per logic block. > mkfs probably writes much larger sizes than that.. As a first step we could try something like that patch below. Although the mpage ops still aren't exactly optimal: diff --git a/fs/block_dev.c b/fs/block_dev.c index bded2ee3788d..b2ee74f1c669 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -608,12 +608,12 @@ EXPORT_SYMBOL(thaw_bdev); static int blkdev_writepage(struct page *page, struct writeback_control *wbc) { - return block_write_full_page(page, blkdev_get_block, wbc); + return mpage_writepage(page, blkdev_get_block, wbc); } static int blkdev_readpage(struct file * file, struct page * page) { - return block_read_full_page(page, blkdev_get_block); + return mpage_readpage(page, blkdev_get_block); } static int blkdev_readpages(struct file *file, struct address_space *mapping, @@ -1984,7 +1984,7 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) static int blkdev_writepages(struct address_space *mapping, struct writeback_control *wbc) { - return generic_writepages(mapping, wbc); + return mpage_writepages(mapping, wbc, blkdev_get_block); } static const struct address_space_operations def_blk_aops = {
diff --git a/block/blk-merge.c b/block/blk-merge.c index 21e87a714a73..80a5a0facb87 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -358,7 +358,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, unsigned front_seg_size; struct bio *fbio, *bbio; struct bvec_iter iter; - bool new_bio = false; if (!bio) return 0; @@ -379,31 +378,12 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, nr_phys_segs = 0; for_each_bio(bio) { bio_for_each_bvec(bv, bio, iter) { - if (new_bio) { - if (seg_size + bv.bv_len - > queue_max_segment_size(q)) - goto new_segment; - if (!biovec_phys_mergeable(q, &bvprv, &bv)) - goto new_segment; - - seg_size += bv.bv_len; - - if (nr_phys_segs == 1 && seg_size > - front_seg_size) - front_seg_size = seg_size; - - continue; - } -new_segment: bvec_split_segs(q, &bv, &nr_phys_segs, &seg_size, &front_seg_size, NULL, UINT_MAX); - new_bio = false; } bbio = bio; - if (likely(bio->bi_iter.bi_size)) { + if (likely(bio->bi_iter.bi_size)) bvprv = bv; - new_bio = true; - } } fbio->bi_seg_front_size = front_seg_size; @@ -725,7 +705,6 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req, req->bio->bi_seg_front_size = seg_size; if (next->nr_phys_segments == 1) next->biotail->bi_seg_back_size = seg_size; - total_phys_segments--; } if (total_phys_segments > queue_max_segments(q))
Currently ll_merge_requests_fn, unlike all other merge functions, reduces nr_phys_segments by one if the last segment of the previous, and the first segment of the next segement are contigous. While this seems like a nice solution to avoid building smaller than possible requests it causes a mismatch between the segments actually present in the request and those iterated over by the bvec iterators, including __rq_for_each_bio. This could cause overwrites of too small kmalloc allocations in any driver using ranged discard, or also mistrigger the single segment optimization in the nvme-pci driver. We could possibly work around this by making the bvec iterators take the front and back segment size into account, but that would require moving them from the bio to the bio_iter and spreading this mess over all users of bvecs. Or we could simply remove this optimization under the assumption that most users already build good enough bvecs, and that the bio merge patch never cared about this optimization either. The latter is what this patch does. Signed-off-by: Christoph Hellwig <hch@lst.de> --- block/blk-merge.c | 23 +---------------------- 1 file changed, 1 insertion(+), 22 deletions(-)