diff mbox series

[1/4] block: don't decrement nr_phys_segments for physically contigous segments

Message ID 20190516084058.20678-2-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [1/4] block: don't decrement nr_phys_segments for physically contigous segments | expand

Commit Message

Christoph Hellwig May 16, 2019, 8:40 a.m. UTC
Currently ll_merge_requests_fn, unlike all other merge functions,
reduces nr_phys_segments by one if the last segment of the previous,
and the first segment of the next segement are contigous.  While this
seems like a nice solution to avoid building smaller than possible
requests it causes a mismatch between the segments actually present
in the request and those iterated over by the bvec iterators, including
__rq_for_each_bio.  This could cause overwrites of too small kmalloc
allocations in any driver using ranged discard, or also mistrigger
the single segment optimization in the nvme-pci driver.

We could possibly work around this by making the bvec iterators take
the front and back segment size into account, but that would require
moving them from the bio to the bio_iter and spreading this mess
over all users of bvecs.  Or we could simply remove this optimization
under the assumption that most users already build good enough bvecs,
and that the bio merge patch never cared about this optimization
either.  The latter is what this patch does.

Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
Fixes: 1f23816b8eb8 ("virtio_blk: add discard and write zeroes support")
Fixes: 297910571f08 ("nvme-pci: optimize mapping single segment requests using SGLs")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-merge.c | 23 +----------------------
 1 file changed, 1 insertion(+), 22 deletions(-)

Comments

Hannes Reinecke May 16, 2019, 8:48 a.m. UTC | #1
On 5/16/19 10:40 AM, Christoph Hellwig wrote:
> Currently ll_merge_requests_fn, unlike all other merge functions,
> reduces nr_phys_segments by one if the last segment of the previous,
> and the first segment of the next segement are contigous.  While this
> seems like a nice solution to avoid building smaller than possible
> requests it causes a mismatch between the segments actually present
> in the request and those iterated over by the bvec iterators, including
> __rq_for_each_bio.  This could cause overwrites of too small kmalloc
> allocations in any driver using ranged discard, or also mistrigger
> the single segment optimization in the nvme-pci driver.
> 
> We could possibly work around this by making the bvec iterators take
> the front and back segment size into account, but that would require
> moving them from the bio to the bio_iter and spreading this mess
> over all users of bvecs.  Or we could simply remove this optimization
> under the assumption that most users already build good enough bvecs,
> and that the bio merge patch never cared about this optimization
> either.  The latter is what this patch does.
> 
> Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
> Fixes: 1f23816b8eb8 ("virtio_blk: add discard and write zeroes support")
> Fixes: 297910571f08 ("nvme-pci: optimize mapping single segment requests using SGLs")
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-merge.c | 23 +----------------------
>   1 file changed, 1 insertion(+), 22 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
Ming Lei May 16, 2019, 1:17 p.m. UTC | #2
Hi Christoph,

On Thu, May 16, 2019 at 10:40:55AM +0200, Christoph Hellwig wrote:
> Currently ll_merge_requests_fn, unlike all other merge functions,
> reduces nr_phys_segments by one if the last segment of the previous,
> and the first segment of the next segement are contigous.  While this
> seems like a nice solution to avoid building smaller than possible
> requests it causes a mismatch between the segments actually present
> in the request and those iterated over by the bvec iterators, including
> __rq_for_each_bio.  This could cause overwrites of too small kmalloc
> allocations in any driver using ranged discard, or also mistrigger
> the single segment optimization in the nvme-pci driver.
> 
> We could possibly work around this by making the bvec iterators take
> the front and back segment size into account, but that would require
> moving them from the bio to the bio_iter and spreading this mess
> over all users of bvecs.  Or we could simply remove this optimization
> under the assumption that most users already build good enough bvecs,
> and that the bio merge patch never cared about this optimization
> either.  The latter is what this patch does.
> 
> Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
> Fixes: 1f23816b8eb8 ("virtio_blk: add discard and write zeroes support")

ll_merge_requests_fn() is only called from attempt_merge() in case
that ELEVATOR_BACK_MERGE is returned from blk_try_req_merge(). However,
for discard merge of both virtio_blk and nvme, ELEVATOR_DISCARD_MERGE is
always returned from blk_try_req_merge() in attempt_merge(), so looks
ll_merge_requests_fn() shouldn't be called for virtio_blk/nvme's discard
request. Just wondering if you may explain a bit how the change on
ll_merge_requests_fn() in this patch makes a difference on the above
two commits?

> Fixes: 297910571f08 ("nvme-pci: optimize mapping single segment requests using SGLs")

I guess it should be dff824b2aadb ("nvme-pci: optimize mapping of small
single segment requests").

Yes, this patch helps for this case, cause blk_rq_nr_phys_segments() may be 1
but there are two bios which share same segment.

Thanks,
Ming
Ming Lei May 17, 2019, 11:02 p.m. UTC | #3
On Thu, May 16, 2019 at 09:17:04PM +0800, Ming Lei wrote:
> Hi Christoph,
> 
> On Thu, May 16, 2019 at 10:40:55AM +0200, Christoph Hellwig wrote:
> > Currently ll_merge_requests_fn, unlike all other merge functions,
> > reduces nr_phys_segments by one if the last segment of the previous,
> > and the first segment of the next segement are contigous.  While this
> > seems like a nice solution to avoid building smaller than possible
> > requests it causes a mismatch between the segments actually present
> > in the request and those iterated over by the bvec iterators, including
> > __rq_for_each_bio.  This could cause overwrites of too small kmalloc
> > allocations in any driver using ranged discard, or also mistrigger
> > the single segment optimization in the nvme-pci driver.
> > 
> > We could possibly work around this by making the bvec iterators take
> > the front and back segment size into account, but that would require
> > moving them from the bio to the bio_iter and spreading this mess
> > over all users of bvecs.  Or we could simply remove this optimization
> > under the assumption that most users already build good enough bvecs,
> > and that the bio merge patch never cared about this optimization
> > either.  The latter is what this patch does.
> > 
> > Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
> > Fixes: 1f23816b8eb8 ("virtio_blk: add discard and write zeroes support")
> 
> ll_merge_requests_fn() is only called from attempt_merge() in case
> that ELEVATOR_BACK_MERGE is returned from blk_try_req_merge(). However,
> for discard merge of both virtio_blk and nvme, ELEVATOR_DISCARD_MERGE is
> always returned from blk_try_req_merge() in attempt_merge(), so looks
> ll_merge_requests_fn() shouldn't be called for virtio_blk/nvme's discard
> request. Just wondering if you may explain a bit how the change on
> ll_merge_requests_fn() in this patch makes a difference on the above
> two commits?
> 
> > Fixes: 297910571f08 ("nvme-pci: optimize mapping single segment requests using SGLs")
> 
> I guess it should be dff824b2aadb ("nvme-pci: optimize mapping of small
> single segment requests").
> 
> Yes, this patch helps for this case, cause blk_rq_nr_phys_segments() may be 1
> but there are two bios which share same segment.

BTW, I just sent a single-line nvme-pci fix on this issue, which may be more
suitable to serve as v5.2 fix:

http://lists.infradead.org/pipermail/linux-nvme/2019-May/024283.html

Thanks,
Ming
Christoph Hellwig May 20, 2019, 11:11 a.m. UTC | #4
On Thu, May 16, 2019 at 09:17:04PM +0800, Ming Lei wrote:
> ll_merge_requests_fn() is only called from attempt_merge() in case
> that ELEVATOR_BACK_MERGE is returned from blk_try_req_merge(). However,
> for discard merge of both virtio_blk and nvme, ELEVATOR_DISCARD_MERGE is
> always returned from blk_try_req_merge() in attempt_merge(), so looks
> ll_merge_requests_fn() shouldn't be called for virtio_blk/nvme's discard
> request. Just wondering if you may explain a bit how the change on
> ll_merge_requests_fn() in this patch makes a difference on the above
> two commits?

Good question.  I've seen virtio overwriting its range, but I think
this might have been been with a series to actually decrement
nr_phys_segments for all cases where we can merge the tail and front
bvecs.  So mainline probably doesn't see it unless someone calls
blk_recalc_rq_segments due to a partial completion or when using
dm-multipath.  Thinking of it at least the latter seems like a real
possibily, although a rather unlikely use case.

> > Fixes: 297910571f08 ("nvme-pci: optimize mapping single segment requests using SGLs")
> 
> I guess it should be dff824b2aadb ("nvme-pci: optimize mapping of small
> single segment requests").

Indeed.
Ming Lei May 21, 2019, 1:04 a.m. UTC | #5
On Mon, May 20, 2019 at 01:11:41PM +0200, Christoph Hellwig wrote:
> On Thu, May 16, 2019 at 09:17:04PM +0800, Ming Lei wrote:
> > ll_merge_requests_fn() is only called from attempt_merge() in case
> > that ELEVATOR_BACK_MERGE is returned from blk_try_req_merge(). However,
> > for discard merge of both virtio_blk and nvme, ELEVATOR_DISCARD_MERGE is
> > always returned from blk_try_req_merge() in attempt_merge(), so looks
> > ll_merge_requests_fn() shouldn't be called for virtio_blk/nvme's discard
> > request. Just wondering if you may explain a bit how the change on
> > ll_merge_requests_fn() in this patch makes a difference on the above
> > two commits?
> 
> Good question.  I've seen virtio overwriting its range, but I think
> this might have been been with a series to actually decrement
> nr_phys_segments for all cases where we can merge the tail and front
> bvecs.  So mainline probably doesn't see it unless someone calls
> blk_recalc_rq_segments due to a partial completion or when using
> dm-multipath.  Thinking of it at least the latter seems like a real
> possibily, although a rather unlikely use case.

This patch shouldn't effect discard IO in case of partial completion too
cause blk_recalc_rq_segments() always return 0 for discard IO w/wo this
patch.

However looks this way is wrong, the following patch may help for this
case:

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1aafeb923e7b..302667887ea1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1109,7 +1109,12 @@ static inline unsigned short blk_rq_nr_phys_segments(struct request *rq)
  */
 static inline unsigned short blk_rq_nr_discard_segments(struct request *rq)
 {
-	return max_t(unsigned short, rq->nr_phys_segments, 1);
+	struct bio *bio;
+	unsigned shart segs = 0;
+
+	__rq_for_each_bio(bio, rq)
+		segs++;
+	return segs;
 }
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);

Or re-calculate the segment number in this way for multi-range discard IO in
__blk_recalc_rq_segments().

Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 21e87a714a73..80a5a0facb87 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -358,7 +358,6 @@  static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	unsigned front_seg_size;
 	struct bio *fbio, *bbio;
 	struct bvec_iter iter;
-	bool new_bio = false;
 
 	if (!bio)
 		return 0;
@@ -379,31 +378,12 @@  static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	nr_phys_segs = 0;
 	for_each_bio(bio) {
 		bio_for_each_bvec(bv, bio, iter) {
-			if (new_bio) {
-				if (seg_size + bv.bv_len
-				    > queue_max_segment_size(q))
-					goto new_segment;
-				if (!biovec_phys_mergeable(q, &bvprv, &bv))
-					goto new_segment;
-
-				seg_size += bv.bv_len;
-
-				if (nr_phys_segs == 1 && seg_size >
-						front_seg_size)
-					front_seg_size = seg_size;
-
-				continue;
-			}
-new_segment:
 			bvec_split_segs(q, &bv, &nr_phys_segs, &seg_size,
 					&front_seg_size, NULL, UINT_MAX);
-			new_bio = false;
 		}
 		bbio = bio;
-		if (likely(bio->bi_iter.bi_size)) {
+		if (likely(bio->bi_iter.bi_size))
 			bvprv = bv;
-			new_bio = true;
-		}
 	}
 
 	fbio->bi_seg_front_size = front_seg_size;
@@ -725,7 +705,6 @@  static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
 			req->bio->bi_seg_front_size = seg_size;
 		if (next->nr_phys_segments == 1)
 			next->biotail->bi_seg_back_size = seg_size;
-		total_phys_segments--;
 	}
 
 	if (total_phys_segments > queue_max_segments(q))