Message ID | 20191104180543.23123-1-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: avoid blk_bio_segment_split for small I/O operations | expand |
On 11/4/19 11:05 AM, Christoph Hellwig wrote: > __blk_queue_split() adds significant overhead for small I/O operations. > Add a shortcut to avoid it for cases where we know we never need to > split. > > Based on a patch from Ming Lei. Applied, thanks.
On Mon, Nov 04, 2019 at 10:05:43AM -0800, Christoph Hellwig wrote: > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 48e6725b32ee..06eb38357b41 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -293,7 +293,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, > void __blk_queue_split(struct request_queue *q, struct bio **bio, > unsigned int *nr_segs) > { > - struct bio *split; > + struct bio *split = NULL; > > switch (bio_op(*bio)) { > case REQ_OP_DISCARD: > @@ -309,6 +309,19 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio, > nr_segs); > break; > default: > + /* > + * All drivers must accept single-segments bios that are <= > + * PAGE_SIZE. This is a quick and dirty check that relies on > + * the fact that bi_io_vec[0] is always valid if a bio has data. > + * The check might lead to occasional false negatives when bios > + * are cloned, but compared to the performance impact of cloned > + * bios themselves the loop below doesn't matter anyway. > + */ > + if ((*bio)->bi_vcnt == 1 && > + (*bio)->bi_io_vec[0].bv_len <= PAGE_SIZE) { > + *nr_segs = 1; > + break; > + } > split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs); > break; > } If the device advertises a chunk boundary and this small IO happens to cross it, skipping the split is going to harm performance.
On 11/4/19 12:30 PM, Keith Busch wrote: > On Mon, Nov 04, 2019 at 10:05:43AM -0800, Christoph Hellwig wrote: >> diff --git a/block/blk-merge.c b/block/blk-merge.c >> index 48e6725b32ee..06eb38357b41 100644 >> --- a/block/blk-merge.c >> +++ b/block/blk-merge.c >> @@ -293,7 +293,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, >> void __blk_queue_split(struct request_queue *q, struct bio **bio, >> unsigned int *nr_segs) >> { >> - struct bio *split; >> + struct bio *split = NULL; >> >> switch (bio_op(*bio)) { >> case REQ_OP_DISCARD: >> @@ -309,6 +309,19 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio, >> nr_segs); >> break; >> default: >> + /* >> + * All drivers must accept single-segments bios that are <= >> + * PAGE_SIZE. This is a quick and dirty check that relies on >> + * the fact that bi_io_vec[0] is always valid if a bio has data. >> + * The check might lead to occasional false negatives when bios >> + * are cloned, but compared to the performance impact of cloned >> + * bios themselves the loop below doesn't matter anyway. >> + */ >> + if ((*bio)->bi_vcnt == 1 && >> + (*bio)->bi_io_vec[0].bv_len <= PAGE_SIZE) { >> + *nr_segs = 1; >> + break; >> + } >> split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs); >> break; >> } > > If the device advertises a chunk boundary and this small IO happens to > cross it, skipping the split is going to harm performance. Does anyone do that, that isn't the first gen intel weirdness? Honest question, but always seemed to me that this spec addition was driven entirely by that one device. And if they do, do they align on non-4k?
On Mon, Nov 04, 2019 at 01:13:53PM -0700, Jens Axboe wrote: > > If the device advertises a chunk boundary and this small IO happens to > > cross it, skipping the split is going to harm performance. > > Does anyone do that, that isn't the first gen intel weirdness? Honest question, > but always seemed to me that this spec addition was driven entirely by that > one device. There are at least 3 generations of Intel DC P-series that use this, maybe more. I'm not sure if any other available vendor devices report this feature, though. > And if they do, do they align on non-4k? All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA format is 512B, you could start a 4k IO at a 126k offset to straddle the boundary. Hm, maybe we don't care about the split penalty in that case since unaligned access is already going to be slower for other reasons ...
On 11/4/19 3:50 PM, Keith Busch wrote: > On Mon, Nov 04, 2019 at 01:13:53PM -0700, Jens Axboe wrote: >>> If the device advertises a chunk boundary and this small IO happens to >>> cross it, skipping the split is going to harm performance. >> >> Does anyone do that, that isn't the first gen intel weirdness? Honest question, >> but always seemed to me that this spec addition was driven entirely by that >> one device. > > There are at least 3 generations of Intel DC P-series that use this, > maybe more. I'm not sure if any other available vendor devices report > this feature, though. Gotcha >> And if they do, do they align on non-4k? > > All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA > format is 512B, you could start a 4k IO at a 126k offset to straddle the > boundary. Hm, maybe we don't care about the split penalty in that case > since unaligned access is already going to be slower for other reasons ... Yeah, not sure that's a huge concern for that particular case. If you think it's a real world issue, it should be possible to set aside a queue bit for this and always have them hit the slower split path.
On Mon, Nov 04, 2019 at 03:55:41PM -0700, Jens Axboe wrote: > > All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA > > format is 512B, you could start a 4k IO at a 126k offset to straddle the > > boundary. Hm, maybe we don't care about the split penalty in that case > > since unaligned access is already going to be slower for other reasons ... > > Yeah, not sure that's a huge concern for that particular case. If you > think it's a real world issue, it should be possible to set aside a > queue bit for this and always have them hit the slower split path. Well, we use that field for zoned devices also, in which case it is an issue. I think I'll need to send a patch to skip the fast path if we have chunk_sectors set.
On 11/4/19 2:50 PM, Keith Busch wrote: > On Mon, Nov 04, 2019 at 01:13:53PM -0700, Jens Axboe wrote: >>> If the device advertises a chunk boundary and this small IO happens to >>> cross it, skipping the split is going to harm performance. >> >> Does anyone do that, that isn't the first gen intel weirdness? Honest question, >> but always seemed to me that this spec addition was driven entirely by that >> one device. > > There are at least 3 generations of Intel DC P-series that use this, > maybe more. I'm not sure if any other available vendor devices report > this feature, though. > >> And if they do, do they align on non-4k? > > All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA > format is 512B, you could start a 4k IO at a 126k offset to straddle the > boundary. Hm, maybe we don't care about the split penalty in that case > since unaligned access is already going to be slower for other reasons ... Aren't NVMe devices expected to set the NOIOB parameter to avoid that NVMe commands straddle boundaries that incur a performance penalty? From the NVMe spec: "Namespace Optimal IO Boundary (NOIOB): This field indicates the optimal IO boundary for this namespace. This field is specified in logical blocks. The host should construct read and write commands that do not cross the IO boundary to achieve optimal performance. A value of 0h indicates that no optimal IO boundary is reported." Thanks, Bart.
On 11/4/19 3:57 PM, Christoph Hellwig wrote: > On Mon, Nov 04, 2019 at 03:55:41PM -0700, Jens Axboe wrote: >>> All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA >>> format is 512B, you could start a 4k IO at a 126k offset to straddle the >>> boundary. Hm, maybe we don't care about the split penalty in that case >>> since unaligned access is already going to be slower for other reasons ... >> >> Yeah, not sure that's a huge concern for that particular case. If you >> think it's a real world issue, it should be possible to set aside a >> queue bit for this and always have them hit the slower split path. > > Well, we use that field for zoned devices also, in which case it is an > issue. I think I'll need to send a patch to skip the fast path if > we have chunk_sectors set. Yes, let's do that, makes sense.
On Mon, Nov 04, 2019 at 02:58:41PM -0800, Bart Van Assche wrote: > On 11/4/19 2:50 PM, Keith Busch wrote: > > On Mon, Nov 04, 2019 at 01:13:53PM -0700, Jens Axboe wrote: > > > > If the device advertises a chunk boundary and this small IO happens to > > > > cross it, skipping the split is going to harm performance. > > > > > > Does anyone do that, that isn't the first gen intel weirdness? Honest question, > > > but always seemed to me that this spec addition was driven entirely by that > > > one device. > > > > There are at least 3 generations of Intel DC P-series that use this, > > maybe more. I'm not sure if any other available vendor devices report > > this feature, though. > > > And if they do, do they align on non-4k? > > > > All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA > > format is 512B, you could start a 4k IO at a 126k offset to straddle the > > boundary. Hm, maybe we don't care about the split penalty in that case > > since unaligned access is already going to be slower for other reasons ... > > Aren't NVMe devices expected to set the NOIOB parameter to avoid that NVMe > commands straddle boundaries that incur a performance penalty? From the NVMe > spec: "Namespace Optimal IO Boundary (NOIOB): This field indicates the > optimal IO boundary for this namespace. This field is specified in logical > blocks. The host should construct read and write commands that do not cross > the IO boundary to achieve optimal performance. A value of 0h indicates that > no optimal IO boundary is reported." Yes, for nvme, noiob is the feature we're talking about. I was initially just thinking about performance, but there's other cases Christoph mentioned where the host split is necessary.
diff --git a/block/blk-merge.c b/block/blk-merge.c index 48e6725b32ee..06eb38357b41 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -293,7 +293,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, void __blk_queue_split(struct request_queue *q, struct bio **bio, unsigned int *nr_segs) { - struct bio *split; + struct bio *split = NULL; switch (bio_op(*bio)) { case REQ_OP_DISCARD: @@ -309,6 +309,19 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio, nr_segs); break; default: + /* + * All drivers must accept single-segments bios that are <= + * PAGE_SIZE. This is a quick and dirty check that relies on + * the fact that bi_io_vec[0] is always valid if a bio has data. + * The check might lead to occasional false negatives when bios + * are cloned, but compared to the performance impact of cloned + * bios themselves the loop below doesn't matter anyway. + */ + if ((*bio)->bi_vcnt == 1 && + (*bio)->bi_io_vec[0].bv_len <= PAGE_SIZE) { + *nr_segs = 1; + break; + } split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs); break; }
__blk_queue_split() adds significant overhead for small I/O operations. Add a shortcut to avoid it for cases where we know we never need to split. Based on a patch from Ming Lei. Signed-off-by: Christoph Hellwig <hch@lst.de> --- block/blk-merge.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)