diff mbox series

block: avoid blk_bio_segment_split for small I/O operations

Message ID 20191104180543.23123-1-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series block: avoid blk_bio_segment_split for small I/O operations | expand

Commit Message

Christoph Hellwig Nov. 4, 2019, 6:05 p.m. UTC
__blk_queue_split() adds significant overhead for small I/O operations.
Add a shortcut to avoid it for cases where we know we never need to
split.

Based on a patch from Ming Lei.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-merge.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Jens Axboe Nov. 4, 2019, 6:24 p.m. UTC | #1
On 11/4/19 11:05 AM, Christoph Hellwig wrote:
> __blk_queue_split() adds significant overhead for small I/O operations.
> Add a shortcut to avoid it for cases where we know we never need to
> split.
> 
> Based on a patch from Ming Lei.

Applied, thanks.
Keith Busch Nov. 4, 2019, 7:30 p.m. UTC | #2
On Mon, Nov 04, 2019 at 10:05:43AM -0800, Christoph Hellwig wrote:
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 48e6725b32ee..06eb38357b41 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -293,7 +293,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>  void __blk_queue_split(struct request_queue *q, struct bio **bio,
>  		unsigned int *nr_segs)
>  {
> -	struct bio *split;
> +	struct bio *split = NULL;
>  
>  	switch (bio_op(*bio)) {
>  	case REQ_OP_DISCARD:
> @@ -309,6 +309,19 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
>  				nr_segs);
>  		break;
>  	default:
> +		/*
> +		 * All drivers must accept single-segments bios that are <=
> +		 * PAGE_SIZE.  This is a quick and dirty check that relies on
> +		 * the fact that bi_io_vec[0] is always valid if a bio has data.
> +		 * The check might lead to occasional false negatives when bios
> +		 * are cloned, but compared to the performance impact of cloned
> +		 * bios themselves the loop below doesn't matter anyway.
> +		 */
> +		if ((*bio)->bi_vcnt == 1 &&
> +		    (*bio)->bi_io_vec[0].bv_len <= PAGE_SIZE) {
> +			*nr_segs = 1;
> +			break;
> +		}
>  		split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs);
>  		break;
>  	}

If the device advertises a chunk boundary and this small IO happens to
cross it, skipping the split is going to harm performance.
Jens Axboe Nov. 4, 2019, 8:13 p.m. UTC | #3
On 11/4/19 12:30 PM, Keith Busch wrote:
> On Mon, Nov 04, 2019 at 10:05:43AM -0800, Christoph Hellwig wrote:
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 48e6725b32ee..06eb38357b41 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -293,7 +293,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>>   void __blk_queue_split(struct request_queue *q, struct bio **bio,
>>   		unsigned int *nr_segs)
>>   {
>> -	struct bio *split;
>> +	struct bio *split = NULL;
>>   
>>   	switch (bio_op(*bio)) {
>>   	case REQ_OP_DISCARD:
>> @@ -309,6 +309,19 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
>>   				nr_segs);
>>   		break;
>>   	default:
>> +		/*
>> +		 * All drivers must accept single-segments bios that are <=
>> +		 * PAGE_SIZE.  This is a quick and dirty check that relies on
>> +		 * the fact that bi_io_vec[0] is always valid if a bio has data.
>> +		 * The check might lead to occasional false negatives when bios
>> +		 * are cloned, but compared to the performance impact of cloned
>> +		 * bios themselves the loop below doesn't matter anyway.
>> +		 */
>> +		if ((*bio)->bi_vcnt == 1 &&
>> +		    (*bio)->bi_io_vec[0].bv_len <= PAGE_SIZE) {
>> +			*nr_segs = 1;
>> +			break;
>> +		}
>>   		split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs);
>>   		break;
>>   	}
> 
> If the device advertises a chunk boundary and this small IO happens to
> cross it, skipping the split is going to harm performance.

Does anyone do that, that isn't the first gen intel weirdness? Honest question,
but always seemed to me that this spec addition was driven entirely by that
one device.

And if they do, do they align on non-4k?
Keith Busch Nov. 4, 2019, 10:50 p.m. UTC | #4
On Mon, Nov 04, 2019 at 01:13:53PM -0700, Jens Axboe wrote:
> > If the device advertises a chunk boundary and this small IO happens to
> > cross it, skipping the split is going to harm performance.
> 
> Does anyone do that, that isn't the first gen intel weirdness? Honest question,
> but always seemed to me that this spec addition was driven entirely by that
> one device.

There are at least 3 generations of Intel DC P-series that use this,
maybe more. I'm not sure if any other available vendor devices report
this feature, though.
 
> And if they do, do they align on non-4k?

All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA
format is 512B, you could start a 4k IO at a 126k offset to straddle the
boundary. Hm, maybe we don't care about the split penalty in that case
since unaligned access is already going to be slower for other reasons ...
Jens Axboe Nov. 4, 2019, 10:55 p.m. UTC | #5
On 11/4/19 3:50 PM, Keith Busch wrote:
> On Mon, Nov 04, 2019 at 01:13:53PM -0700, Jens Axboe wrote:
>>> If the device advertises a chunk boundary and this small IO happens to
>>> cross it, skipping the split is going to harm performance.
>>
>> Does anyone do that, that isn't the first gen intel weirdness? Honest question,
>> but always seemed to me that this spec addition was driven entirely by that
>> one device.
> 
> There are at least 3 generations of Intel DC P-series that use this,
> maybe more. I'm not sure if any other available vendor devices report
> this feature, though.

Gotcha

>> And if they do, do they align on non-4k?
> 
> All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA
> format is 512B, you could start a 4k IO at a 126k offset to straddle the
> boundary. Hm, maybe we don't care about the split penalty in that case
> since unaligned access is already going to be slower for other reasons ...

Yeah, not sure that's a huge concern for that particular case. If you
think it's a real world issue, it should be possible to set aside a
queue bit for this and always have them hit the slower split path.
Christoph Hellwig Nov. 4, 2019, 10:57 p.m. UTC | #6
On Mon, Nov 04, 2019 at 03:55:41PM -0700, Jens Axboe wrote:
> > All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA
> > format is 512B, you could start a 4k IO at a 126k offset to straddle the
> > boundary. Hm, maybe we don't care about the split penalty in that case
> > since unaligned access is already going to be slower for other reasons ...
> 
> Yeah, not sure that's a huge concern for that particular case. If you
> think it's a real world issue, it should be possible to set aside a
> queue bit for this and always have them hit the slower split path.

Well, we use that field for zoned devices also, in which case it is an
issue.  I think I'll need to send a patch to skip the fast path if
we have chunk_sectors set.
Bart Van Assche Nov. 4, 2019, 10:58 p.m. UTC | #7
On 11/4/19 2:50 PM, Keith Busch wrote:
> On Mon, Nov 04, 2019 at 01:13:53PM -0700, Jens Axboe wrote:
>>> If the device advertises a chunk boundary and this small IO happens to
>>> cross it, skipping the split is going to harm performance.
>>
>> Does anyone do that, that isn't the first gen intel weirdness? Honest question,
>> but always seemed to me that this spec addition was driven entirely by that
>> one device.
> 
> There are at least 3 generations of Intel DC P-series that use this,
> maybe more. I'm not sure if any other available vendor devices report
> this feature, though.
>   
>> And if they do, do they align on non-4k?
> 
> All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA
> format is 512B, you could start a 4k IO at a 126k offset to straddle the
> boundary. Hm, maybe we don't care about the split penalty in that case
> since unaligned access is already going to be slower for other reasons ...

Aren't NVMe devices expected to set the NOIOB parameter to avoid that 
NVMe commands straddle boundaries that incur a performance penalty? From 
the NVMe spec: "Namespace Optimal IO Boundary (NOIOB): This field 
indicates the optimal IO boundary for this namespace. This field is 
specified in logical blocks. The host should construct read and write 
commands that do not cross the IO boundary to achieve optimal 
performance. A value of 0h indicates that no optimal IO boundary is 
reported."

Thanks,

Bart.
Jens Axboe Nov. 4, 2019, 10:59 p.m. UTC | #8
On 11/4/19 3:57 PM, Christoph Hellwig wrote:
> On Mon, Nov 04, 2019 at 03:55:41PM -0700, Jens Axboe wrote:
>>> All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA
>>> format is 512B, you could start a 4k IO at a 126k offset to straddle the
>>> boundary. Hm, maybe we don't care about the split penalty in that case
>>> since unaligned access is already going to be slower for other reasons ...
>>
>> Yeah, not sure that's a huge concern for that particular case. If you
>> think it's a real world issue, it should be possible to set aside a
>> queue bit for this and always have them hit the slower split path.
> 
> Well, we use that field for zoned devices also, in which case it is an
> issue.  I think I'll need to send a patch to skip the fast path if
> we have chunk_sectors set.

Yes, let's do that, makes sense.
Keith Busch Nov. 4, 2019, 11:04 p.m. UTC | #9
On Mon, Nov 04, 2019 at 02:58:41PM -0800, Bart Van Assche wrote:
> On 11/4/19 2:50 PM, Keith Busch wrote:
> > On Mon, Nov 04, 2019 at 01:13:53PM -0700, Jens Axboe wrote:
> > > > If the device advertises a chunk boundary and this small IO happens to
> > > > cross it, skipping the split is going to harm performance.
> > > 
> > > Does anyone do that, that isn't the first gen intel weirdness? Honest question,
> > > but always seemed to me that this spec addition was driven entirely by that
> > > one device.
> > 
> > There are at least 3 generations of Intel DC P-series that use this,
> > maybe more. I'm not sure if any other available vendor devices report
> > this feature, though.
> > > And if they do, do they align on non-4k?
> > 
> > All existing ones I'm aware of are 128k, so 4k aligned, but if the LBA
> > format is 512B, you could start a 4k IO at a 126k offset to straddle the
> > boundary. Hm, maybe we don't care about the split penalty in that case
> > since unaligned access is already going to be slower for other reasons ...
> 
> Aren't NVMe devices expected to set the NOIOB parameter to avoid that NVMe
> commands straddle boundaries that incur a performance penalty? From the NVMe
> spec: "Namespace Optimal IO Boundary (NOIOB): This field indicates the
> optimal IO boundary for this namespace. This field is specified in logical
> blocks. The host should construct read and write commands that do not cross
> the IO boundary to achieve optimal performance. A value of 0h indicates that
> no optimal IO boundary is reported."

Yes, for nvme, noiob is the feature we're talking about.

I was initially just thinking about performance, but there's other
cases Christoph mentioned where the host split is necessary.
diff mbox series

Patch

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 48e6725b32ee..06eb38357b41 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -293,7 +293,7 @@  static struct bio *blk_bio_segment_split(struct request_queue *q,
 void __blk_queue_split(struct request_queue *q, struct bio **bio,
 		unsigned int *nr_segs)
 {
-	struct bio *split;
+	struct bio *split = NULL;
 
 	switch (bio_op(*bio)) {
 	case REQ_OP_DISCARD:
@@ -309,6 +309,19 @@  void __blk_queue_split(struct request_queue *q, struct bio **bio,
 				nr_segs);
 		break;
 	default:
+		/*
+		 * All drivers must accept single-segments bios that are <=
+		 * PAGE_SIZE.  This is a quick and dirty check that relies on
+		 * the fact that bi_io_vec[0] is always valid if a bio has data.
+		 * The check might lead to occasional false negatives when bios
+		 * are cloned, but compared to the performance impact of cloned
+		 * bios themselves the loop below doesn't matter anyway.
+		 */
+		if ((*bio)->bi_vcnt == 1 &&
+		    (*bio)->bi_io_vec[0].bv_len <= PAGE_SIZE) {
+			*nr_segs = 1;
+			break;
+		}
 		split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs);
 		break;
 	}