[3/4] blk-mq: provide internal in-flight variant

Message ID	1501790516-6924-4-git-send-email-axboe@kernel.dk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Jens Axboe <axboe@kernel.dk> To: linux-block@vger.kernel.org Cc: brking@linux.vnet.ibm.com, Jens Axboe <axboe@kernel.dk> Subject: [PATCH 3/4] blk-mq: provide internal in-flight variant Date: Thu, 3 Aug 2017 14:01:55 -0600 Message-Id: <1501790516-6924-4-git-send-email-axboe@kernel.dk> In-Reply-To: <1501790516-6924-1-git-send-email-axboe@kernel.dk> References: <1501790516-6924-1-git-send-email-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Jens Axboe Aug. 3, 2017, 8:01 p.m. UTC

We don't have to inc/dec some counter, since we can just
iterate the tags. That makes inc/dec a noop, but means we
have to iterate busy tags to get an in-flight count.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c        | 24 ++++++++++++++++++++++++
 block/blk-mq.h        |  2 ++
 block/genhd.c         | 29 +++++++++++++++++++++++++++++
 include/linux/genhd.h | 25 +++----------------------
 4 files changed, 58 insertions(+), 22 deletions(-)

Bart Van Assche Aug. 3, 2017, 8:41 p.m. UTC | #1

On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> We don't have to inc/dec some counter, since we can just
> iterate the tags. That makes inc/dec a noop, but means we
> have to iterate busy tags to get an in-flight count.
> [ ... ]
> +unsigned int blk_mq_in_flight(struct request_queue *q,
> +			       struct hd_struct *part)
> +{
> +	struct mq_inflight mi = { .part = part, .inflight = 0 };

Hello Jens,

A minor stylistic comment: since a C compiler is required to initialize to zero
all member variables that have not been initialized explicitly I think
".inflight = 0" can be left out.

> diff --git a/block/genhd.c b/block/genhd.c
> index f735af67a0c9..ad5dc567d57f 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -45,6 +45,35 @@ static void disk_add_events(struct gendisk *disk);
>  static void disk_del_events(struct gendisk *disk);
>  static void disk_release_events(struct gendisk *disk);
>  
> +void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
> +{
> +	if (q->mq_ops)
> +		return;
> +
> +	atomic_inc(&part->in_flight[rw]);
> +	if (part->partno)
> +		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
> +}
> [ ... ]
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 7f7427e00f9c..f2c5096b3a7e 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -362,28 +362,9 @@ static inline void free_part_stats(struct hd_struct *part)
>  #define part_stat_sub(cpu, gendiskp, field, subnd)			\
>  	part_stat_add(cpu, gendiskp, field, -subnd)
>  
> -static inline void part_inc_in_flight(struct request_queue *q,
> -				      struct hd_struct *part, int rw)
> -{
> -	atomic_inc(&part->in_flight[rw]);
> -	if (part->partno)
> -		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
> -}
> [ ... ]

Sorry but to me it seems like this part of the patch does match with the patch
description? The patch description mentions that inc and dec become a noop but
it seems to me that these functions have been uninlined instead of making these
a noop?

Bart.

Jens Axboe Aug. 3, 2017, 8:45 p.m. UTC | #2

On 08/03/2017 02:41 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>> We don't have to inc/dec some counter, since we can just
>> iterate the tags. That makes inc/dec a noop, but means we
>> have to iterate busy tags to get an in-flight count.
>> [ ... ]
>> +unsigned int blk_mq_in_flight(struct request_queue *q,
>> +			       struct hd_struct *part)
>> +{
>> +	struct mq_inflight mi = { .part = part, .inflight = 0 };
> 
> Hello Jens,
> 
> A minor stylistic comment: since a C compiler is required to
> initialize to zero all member variables that have not been initialized
> explicitly I think ".inflight = 0" can be left out.

It can, I'll kill it.

>> diff --git a/block/genhd.c b/block/genhd.c
>> index f735af67a0c9..ad5dc567d57f 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -45,6 +45,35 @@ static void disk_add_events(struct gendisk *disk);
>>  static void disk_del_events(struct gendisk *disk);
>>  static void disk_release_events(struct gendisk *disk);
>>  
>> +void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
>> +{
>> +	if (q->mq_ops)
>> +		return;
>> +
>> +	atomic_inc(&part->in_flight[rw]);
>> +	if (part->partno)
>> +		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
>> +}
>> [ ... ]
>> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
>> index 7f7427e00f9c..f2c5096b3a7e 100644
>> --- a/include/linux/genhd.h
>> +++ b/include/linux/genhd.h
>> @@ -362,28 +362,9 @@ static inline void free_part_stats(struct hd_struct *part)
>>  #define part_stat_sub(cpu, gendiskp, field, subnd)			\
>>  	part_stat_add(cpu, gendiskp, field, -subnd)
>>  
>> -static inline void part_inc_in_flight(struct request_queue *q,
>> -				      struct hd_struct *part, int rw)
>> -{
>> -	atomic_inc(&part->in_flight[rw]);
>> -	if (part->partno)
>> -		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
>> -}
>> [ ... ]
> 
> Sorry but to me it seems like this part of the patch does match with
> the patch description? The patch description mentions that inc and dec
> become a noop but it seems to me that these functions have been
> uninlined instead of making these a noop?

The inc/dec goes away for mq, the non-mq path still has to use them. I just
move them as well. Could be a prep patch, but it's just moving the code out
of the header and into a normal C file instead.

Bart Van Assche Aug. 3, 2017, 8:54 p.m. UTC | #3

On Thu, 2017-08-03 at 14:45 -0600, Jens Axboe wrote:
> The inc/dec goes away for mq, the non-mq path still has to use them. I just
> move them as well. Could be a prep patch, but it's just moving the code out
> of the header and into a normal C file instead.

Hello Jens,

I misread that part of the patch. Now I have had another look these changes
look fine to me. Hence:

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>

Bart Van Assche Aug. 3, 2017, 9:25 p.m. UTC | #4

On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
> +				  struct request *rq, void *priv,
> +				  bool reserved)
> +{
> +	struct mq_inflight *mi = priv;
> +
> +	if (rq->part == mi->part)
> +		mi->inflight++;
> +}
> [ ... ]
> -static inline void part_inc_in_flight(struct request_queue *q,
> -				      struct hd_struct *part, int rw)
> -{
> -	atomic_inc(&part->in_flight[rw]);
> -	if (part->partno)
> -		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
> -}

Hello Jens,

The existing part_inc_in_flight() code includes all requests in the in_flight
statistics for part0 but the new code in blk_mq_check_inflight() not. Is that
on purpose? Should the rq->part == mi->part check perhaps be skipped if
mi->part represents part0?

Thanks,

Bart.

Jens Axboe Aug. 3, 2017, 10:36 p.m. UTC | #5

On 08/03/2017 03:25 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>> +				  struct request *rq, void *priv,
>> +				  bool reserved)
>> +{
>> +	struct mq_inflight *mi = priv;
>> +
>> +	if (rq->part == mi->part)
>> +		mi->inflight++;
>> +}
>> [ ... ]
>> -static inline void part_inc_in_flight(struct request_queue *q,
>> -				      struct hd_struct *part, int rw)
>> -{
>> -	atomic_inc(&part->in_flight[rw]);
>> -	if (part->partno)
>> -		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
>> -}
> 
> Hello Jens,
> 
> The existing part_inc_in_flight() code includes all requests in the in_flight
> statistics for part0 but the new code in blk_mq_check_inflight() not. Is that
> on purpose? Should the rq->part == mi->part check perhaps be skipped if
> mi->part represents part0?

The existing code increments always for the partition in question, and
for the root if it's a partition. I'll take a look at that logic, and
ensure it's all correct.

Ming Lei Aug. 4, 2017, 11:17 a.m. UTC | #6

On Thu, Aug 03, 2017 at 02:01:55PM -0600, Jens Axboe wrote:
> We don't have to inc/dec some counter, since we can just
> iterate the tags. That makes inc/dec a noop, but means we
> have to iterate busy tags to get an in-flight count.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/blk-mq.c        | 24 ++++++++++++++++++++++++
>  block/blk-mq.h        |  2 ++
>  block/genhd.c         | 29 +++++++++++++++++++++++++++++
>  include/linux/genhd.h | 25 +++----------------------
>  4 files changed, 58 insertions(+), 22 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 05dfa3f270ae..37035891e120 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -86,6 +86,30 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
>  	sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
>  }
>  
> +struct mq_inflight {
> +	struct hd_struct *part;
> +	unsigned int inflight;
> +};
> +
> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
> +				  struct request *rq, void *priv,
> +				  bool reserved)
> +{
> +	struct mq_inflight *mi = priv;
> +
> +	if (rq->part == mi->part)
> +		mi->inflight++;
> +}
> +
> +unsigned int blk_mq_in_flight(struct request_queue *q,
> +			       struct hd_struct *part)
> +{
> +	struct mq_inflight mi = { .part = part, .inflight = 0 };
> +
> +	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
> +	return mi.inflight;
> +}

IMO it might not be as efficient as per-cpu variable.

For example, NVMe on one 128-core system, if we use percpu variable,
it is enough to read 128 local variable from each CPU for accounting
one in_flight.

But in this way of blk_mq_in_flight(), we need to do 128 
sbitmap search, and one sbitmap search need to read at least
16 words of 'unsigned long',  and total 128*16 read.

So maybe we need to compare the two approaches first.

Thanks,
Ming

Jens Axboe Aug. 4, 2017, 1:55 p.m. UTC | #7

On 08/04/2017 05:17 AM, Ming Lei wrote:
> On Thu, Aug 03, 2017 at 02:01:55PM -0600, Jens Axboe wrote:
>> We don't have to inc/dec some counter, since we can just
>> iterate the tags. That makes inc/dec a noop, but means we
>> have to iterate busy tags to get an in-flight count.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  block/blk-mq.c        | 24 ++++++++++++++++++++++++
>>  block/blk-mq.h        |  2 ++
>>  block/genhd.c         | 29 +++++++++++++++++++++++++++++
>>  include/linux/genhd.h | 25 +++----------------------
>>  4 files changed, 58 insertions(+), 22 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 05dfa3f270ae..37035891e120 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -86,6 +86,30 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
>>  	sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
>>  }
>>  
>> +struct mq_inflight {
>> +	struct hd_struct *part;
>> +	unsigned int inflight;
>> +};
>> +
>> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>> +				  struct request *rq, void *priv,
>> +				  bool reserved)
>> +{
>> +	struct mq_inflight *mi = priv;
>> +
>> +	if (rq->part == mi->part)
>> +		mi->inflight++;
>> +}
>> +
>> +unsigned int blk_mq_in_flight(struct request_queue *q,
>> +			       struct hd_struct *part)
>> +{
>> +	struct mq_inflight mi = { .part = part, .inflight = 0 };
>> +
>> +	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
>> +	return mi.inflight;
>> +}
> 
> IMO it might not be as efficient as per-cpu variable.
> 
> For example, NVMe on one 128-core system, if we use percpu variable,
> it is enough to read 128 local variable from each CPU for accounting
> one in_flight.

IFF the system is configured with NR_CPUS=128. Most distros go
much bigger. On the other hand, we know that nr_queues will
never be bigger than the number of online cpus, not the number
of possible cpus.

> But in this way of blk_mq_in_flight(), we need to do 128 
> sbitmap search, and one sbitmap search need to read at least
> 16 words of 'unsigned long',  and total 128*16 read.

If that ends up being a problem, it hasn't in testing, then we
could always stuff an index in front of the full sbitmap.

> So maybe we need to compare the two approaches first.

We already did, back when this was originally posted. See the
thread from end may / start june and the results from Brian.

Ming Lei Aug. 4, 2017, 10:19 p.m. UTC | #8

On Fri, Aug 04, 2017 at 07:55:41AM -0600, Jens Axboe wrote:
> On 08/04/2017 05:17 AM, Ming Lei wrote:
> > On Thu, Aug 03, 2017 at 02:01:55PM -0600, Jens Axboe wrote:
> >> We don't have to inc/dec some counter, since we can just
> >> iterate the tags. That makes inc/dec a noop, but means we
> >> have to iterate busy tags to get an in-flight count.
> >>
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >> ---
> >>  block/blk-mq.c        | 24 ++++++++++++++++++++++++
> >>  block/blk-mq.h        |  2 ++
> >>  block/genhd.c         | 29 +++++++++++++++++++++++++++++
> >>  include/linux/genhd.h | 25 +++----------------------
> >>  4 files changed, 58 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 05dfa3f270ae..37035891e120 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -86,6 +86,30 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
> >>  	sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
> >>  }
> >>  
> >> +struct mq_inflight {
> >> +	struct hd_struct *part;
> >> +	unsigned int inflight;
> >> +};
> >> +
> >> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
> >> +				  struct request *rq, void *priv,
> >> +				  bool reserved)
> >> +{
> >> +	struct mq_inflight *mi = priv;
> >> +
> >> +	if (rq->part == mi->part)
> >> +		mi->inflight++;
> >> +}
> >> +
> >> +unsigned int blk_mq_in_flight(struct request_queue *q,
> >> +			       struct hd_struct *part)
> >> +{
> >> +	struct mq_inflight mi = { .part = part, .inflight = 0 };
> >> +
> >> +	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
> >> +	return mi.inflight;
> >> +}
> > 
> > IMO it might not be as efficient as per-cpu variable.
> > 
> > For example, NVMe on one 128-core system, if we use percpu variable,
> > it is enough to read 128 local variable from each CPU for accounting
> > one in_flight.
> 
> IFF the system is configured with NR_CPUS=128. Most distros go
> much bigger. On the other hand, we know that nr_queues will
> never be bigger than the number of online cpus, not the number
> of possible cpus.

We usually use for_each_possible_cpu() for aggregating CPU
local counters, and num_possible_cpus() is the number of
CPUs polulatable in system, which is much less than NR_CPUs:

include/linux/cpumask.h:
*     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable

> 
> > But in this way of blk_mq_in_flight(), we need to do 128 
> > sbitmap search, and one sbitmap search need to read at least
> > 16 words of 'unsigned long',  and total 128*16 read.
> 
> If that ends up being a problem, it hasn't in testing, then we
> could always stuff an index in front of the full sbitmap.
> 
> > So maybe we need to compare the two approaches first.
> 
> We already did, back when this was originally posted. See the
> thread from end may / start june and the results from Brian.

Can't find the compasison data between percpu accounting vs. mq-infilight
in that thread.

Just saw Brian mentioned in patch log that percpu may reach
11.4M(I guess 'M' is missed) [1]:

	"When running this on a Power system, to a single null_blk device
	with 80 submission queues, irq mode 0, with 80 fio jobs, I saw IOPs
	go from 1.5M IO/s to 11.4 IO/s."

But in link[2], he said mq-flight can reach 9.4M.

Could Brian explain it a bit? Maybe the two tests were run in
different settings, don't know.

Even though mq-flight is better, I guess we need to understand
the principle behind why it is better than percpu...


[1] http://marc.info/?l=linux-block&m=149868436905520&w=2
[2] http://marc.info/?l=linux-block&m=149920174301644&w=2

Thanks,
Ming

Brian King Aug. 7, 2017, 7:54 p.m. UTC | #9

On 08/04/2017 05:19 PM, Ming Lei wrote:
> On Fri, Aug 04, 2017 at 07:55:41AM -0600, Jens Axboe wrote:
>> On 08/04/2017 05:17 AM, Ming Lei wrote:
>>> On Thu, Aug 03, 2017 at 02:01:55PM -0600, Jens Axboe wrote:
>>>> We don't have to inc/dec some counter, since we can just
>>>> iterate the tags. That makes inc/dec a noop, but means we
>>>> have to iterate busy tags to get an in-flight count.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>  block/blk-mq.c        | 24 ++++++++++++++++++++++++
>>>>  block/blk-mq.h        |  2 ++
>>>>  block/genhd.c         | 29 +++++++++++++++++++++++++++++
>>>>  include/linux/genhd.h | 25 +++----------------------
>>>>  4 files changed, 58 insertions(+), 22 deletions(-)
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 05dfa3f270ae..37035891e120 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -86,6 +86,30 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
>>>>  	sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
>>>>  }
>>>>  
>>>> +struct mq_inflight {
>>>> +	struct hd_struct *part;
>>>> +	unsigned int inflight;
>>>> +};
>>>> +
>>>> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>>>> +				  struct request *rq, void *priv,
>>>> +				  bool reserved)
>>>> +{
>>>> +	struct mq_inflight *mi = priv;
>>>> +
>>>> +	if (rq->part == mi->part)
>>>> +		mi->inflight++;
>>>> +}
>>>> +
>>>> +unsigned int blk_mq_in_flight(struct request_queue *q,
>>>> +			       struct hd_struct *part)
>>>> +{
>>>> +	struct mq_inflight mi = { .part = part, .inflight = 0 };
>>>> +
>>>> +	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
>>>> +	return mi.inflight;
>>>> +}
>>>
>>> IMO it might not be as efficient as per-cpu variable.
>>>
>>> For example, NVMe on one 128-core system, if we use percpu variable,
>>> it is enough to read 128 local variable from each CPU for accounting
>>> one in_flight.
>>
>> IFF the system is configured with NR_CPUS=128. Most distros go
>> much bigger. On the other hand, we know that nr_queues will
>> never be bigger than the number of online cpus, not the number
>> of possible cpus.
> 
> We usually use for_each_possible_cpu() for aggregating CPU
> local counters, and num_possible_cpus() is the number of
> CPUs polulatable in system, which is much less than NR_CPUs:
> 
> include/linux/cpumask.h:
> *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
> 
>>
>>> But in this way of blk_mq_in_flight(), we need to do 128 
>>> sbitmap search, and one sbitmap search need to read at least
>>> 16 words of 'unsigned long',  and total 128*16 read.
>>
>> If that ends up being a problem, it hasn't in testing, then we
>> could always stuff an index in front of the full sbitmap.
>>
>>> So maybe we need to compare the two approaches first.
>>
>> We already did, back when this was originally posted. See the
>> thread from end may / start june and the results from Brian.
> 
> Can't find the compasison data between percpu accounting vs. mq-infilight
> in that thread.
> 
> Just saw Brian mentioned in patch log that percpu may reach
> 11.4M(I guess 'M' is missed) [1]:
> 
> 	"When running this on a Power system, to a single null_blk device
> 	with 80 submission queues, irq mode 0, with 80 fio jobs, I saw IOPs
> 	go from 1.5M IO/s to 11.4 IO/s."
> 
> But in link[2], he said mq-flight can reach 9.4M.
> 
> Could Brian explain it a bit? Maybe the two tests were run in
> different settings, don't know.

The 11.4M IOPs vs 9.4M IOPs runs cannot be directly compared, as they were
run on different systems with different NVMe devices.

The key comparison I kept coming back to in my measurements was:

N jobs run to 1 null_blk vs. N null_blk devices, each with 1 job

If the IOPs I get between the two is similar, that should show that
we don't have scaling issues in blk-mq. 

There were three variations of patches I tried with:

* per-cpu - Patch from me

* per-node-atomic - Patch from Ming

* mq-inflight - Patch from Jens

All of them provided a massive improvement in my environment. The mq-inflight
was only marginally better than the per node and it was most prominent in the
N null_blk each with 1 job. While the per-node atomic approach certainly
reduced cross node contention on the atomics, they are still atomics, which
have a bit of overhead, particularly on the Power platform. 

As for the difference between the percpu approach and the mq-inflight approach,
I didn't compare them directly in the same config, since I didn't think the
percpu approach would go anywhere after the initial discussion we had on the list.

I'll get some time on the test machine again and do a direct comparison between
in-flight and percpu to see if there are any significant difference.

Thanks,

Brian

[3/4] blk-mq: provide internal in-flight variant

Commit Message

Comments

Patch