[v2,4/6] block: switch to per-cpu in-flight counters
diff mbox series

Message ID 20181130222226.77216-5-snitzer@redhat.com
State New
Headers show
Series
  • per-cpu in_flight counters for bio-based drivers
Related show

Commit Message

Mike Snitzer Nov. 30, 2018, 10:22 p.m. UTC
From: Mikulas Patocka <mpatocka@redhat.com>

Now when part_round_stats is gone, we can switch to per-cpu in-flight
counters.

We use the local-atomic type local_t, so that if part_inc_in_flight or
part_dec_in_flight is reentrantly called from an interrupt, the value will
be correct.

The other counters could be corrupted due to reentrant interrupt, but the
corruption only results in slight counter skew - the in_flight counter
must be exact, so it needs local_t.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/bio.c           |  4 ++--
 block/blk-core.c      |  4 ++--
 block/blk-merge.c     |  2 +-
 block/genhd.c         | 47 +++++++++++++++++++++++++++++++++++------------
 include/linux/genhd.h |  7 ++++---
 5 files changed, 44 insertions(+), 20 deletions(-)

Comments

Jens Axboe Dec. 5, 2018, 5:30 p.m. UTC | #1
On 11/30/18 3:22 PM, Mike Snitzer wrote:
> diff --git a/block/genhd.c b/block/genhd.c
> index cdf174d7d329..d4c9dd65def6 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -45,53 +45,76 @@ static void disk_add_events(struct gendisk *disk);
>  static void disk_del_events(struct gendisk *disk);
>  static void disk_release_events(struct gendisk *disk);
>  
> -void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
> +void part_inc_in_flight(struct request_queue *q, int cpu, struct hd_struct *part, int rw)
>  {
>  	if (queue_is_mq(q))
>  		return;
>  
> -	atomic_inc(&part->in_flight[rw]);
> +	local_inc(&per_cpu_ptr(part->dkstats, cpu)->in_flight[rw]);

I mentioned this in a previous email, but why isn't this just using
this_cpu_inc? There's also no need to pass in the cpu, if we're not
running with preempt disabled already we have a problem.
Mike Snitzer Dec. 5, 2018, 5:49 p.m. UTC | #2
On Wed, Dec 05 2018 at 12:30pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 11/30/18 3:22 PM, Mike Snitzer wrote:
> > diff --git a/block/genhd.c b/block/genhd.c
> > index cdf174d7d329..d4c9dd65def6 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -45,53 +45,76 @@ static void disk_add_events(struct gendisk *disk);
> >  static void disk_del_events(struct gendisk *disk);
> >  static void disk_release_events(struct gendisk *disk);
> >  
> > -void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
> > +void part_inc_in_flight(struct request_queue *q, int cpu, struct hd_struct *part, int rw)
> >  {
> >  	if (queue_is_mq(q))
> >  		return;
> >  
> > -	atomic_inc(&part->in_flight[rw]);
> > +	local_inc(&per_cpu_ptr(part->dkstats, cpu)->in_flight[rw]);
> 
> I mentioned this in a previous email, but why isn't this just using
> this_cpu_inc?

I responded to your earlier question on this point but, Mikulas just
extended the existing percpu struct disk_stats and he is using local_t
for reasons detailed in this patch's header:

    We use the local-atomic type local_t, so that if part_inc_in_flight or
    part_dec_in_flight is reentrantly called from an interrupt, the value will
    be correct.

    The other counters could be corrupted due to reentrant interrupt, but the
    corruption only results in slight counter skew - the in_flight counter
    must be exact, so it needs local_t.

> There's also no need to pass in the cpu, if we're not running with
> preempt disabled already we have a problem. 

Why should this be any different than the part_stat_* interfaces?
__part_stat_add(), part_stat_read(), etc also use
per_cpu_ptr((part)->dkstats, (cpu) accessors.
Jens Axboe Dec. 5, 2018, 5:54 p.m. UTC | #3
On 12/5/18 10:49 AM, Mike Snitzer wrote:
> On Wed, Dec 05 2018 at 12:30pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
> 
>> On 11/30/18 3:22 PM, Mike Snitzer wrote:
>>> diff --git a/block/genhd.c b/block/genhd.c
>>> index cdf174d7d329..d4c9dd65def6 100644
>>> --- a/block/genhd.c
>>> +++ b/block/genhd.c
>>> @@ -45,53 +45,76 @@ static void disk_add_events(struct gendisk *disk);
>>>  static void disk_del_events(struct gendisk *disk);
>>>  static void disk_release_events(struct gendisk *disk);
>>>  
>>> -void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
>>> +void part_inc_in_flight(struct request_queue *q, int cpu, struct hd_struct *part, int rw)
>>>  {
>>>  	if (queue_is_mq(q))
>>>  		return;
>>>  
>>> -	atomic_inc(&part->in_flight[rw]);
>>> +	local_inc(&per_cpu_ptr(part->dkstats, cpu)->in_flight[rw]);
>>
>> I mentioned this in a previous email, but why isn't this just using
>> this_cpu_inc?
> 
> I responded to your earlier question on this point but, Mikulas just
> extended the existing percpu struct disk_stats and he is using local_t
> for reasons detailed in this patch's header:
> 
>     We use the local-atomic type local_t, so that if part_inc_in_flight or
>     part_dec_in_flight is reentrantly called from an interrupt, the value will
>     be correct.
> 
>     The other counters could be corrupted due to reentrant interrupt, but the
>     corruption only results in slight counter skew - the in_flight counter
>     must be exact, so it needs local_t.

Gotcha, make sense.

>> There's also no need to pass in the cpu, if we're not running with
>> preempt disabled already we have a problem. 
> 
> Why should this be any different than the part_stat_* interfaces?
> __part_stat_add(), part_stat_read(), etc also use
> per_cpu_ptr((part)->dkstats, (cpu) accessors.

Maybe audit which ones actually need it? To answer the specific question,
it's silly to pass in the cpu, if we're pinned already. That's true
both programatically, but also for someone reading the code.
Mike Snitzer Dec. 5, 2018, 6:03 p.m. UTC | #4
On Wed, Dec 05 2018 at 12:54pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 12/5/18 10:49 AM, Mike Snitzer wrote:
> > On Wed, Dec 05 2018 at 12:30pm -0500,
> > Jens Axboe <axboe@kernel.dk> wrote:
> > 
> >> There's also no need to pass in the cpu, if we're not running with
> >> preempt disabled already we have a problem. 
> > 
> > Why should this be any different than the part_stat_* interfaces?
> > __part_stat_add(), part_stat_read(), etc also use
> > per_cpu_ptr((part)->dkstats, (cpu) accessors.
> 
> Maybe audit which ones actually need it? To answer the specific question,
> it's silly to pass in the cpu, if we're pinned already. That's true
> both programatically, but also for someone reading the code.

I understand you'd like to avoid excess interface baggage.  But seems to
me we'd be better off being consistent, when extending the percpu
portion of block core stats, and then do an incremental to clean it all
up.

But I'm open to doing it however you'd like if you feel strongly about
how this should be done.

Mike
Jens Axboe Dec. 5, 2018, 6:04 p.m. UTC | #5
On 12/5/18 11:03 AM, Mike Snitzer wrote:
> On Wed, Dec 05 2018 at 12:54pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
> 
>> On 12/5/18 10:49 AM, Mike Snitzer wrote:
>>> On Wed, Dec 05 2018 at 12:30pm -0500,
>>> Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>> There's also no need to pass in the cpu, if we're not running with
>>>> preempt disabled already we have a problem. 
>>>
>>> Why should this be any different than the part_stat_* interfaces?
>>> __part_stat_add(), part_stat_read(), etc also use
>>> per_cpu_ptr((part)->dkstats, (cpu) accessors.
>>
>> Maybe audit which ones actually need it? To answer the specific question,
>> it's silly to pass in the cpu, if we're pinned already. That's true
>> both programatically, but also for someone reading the code.
> 
> I understand you'd like to avoid excess interface baggage.  But seems to
> me we'd be better off being consistent, when extending the percpu
> portion of block core stats, and then do an incremental to clean it all
> up.

The incremental should be done first in that case, it'd be silly to
introduce something only to do a cleanup right after.
Mike Snitzer Dec. 5, 2018, 6:18 p.m. UTC | #6
On Wed, Dec 05 2018 at  1:04pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 12/5/18 11:03 AM, Mike Snitzer wrote:
> > On Wed, Dec 05 2018 at 12:54pm -0500,
> > Jens Axboe <axboe@kernel.dk> wrote:
> > 
> >> On 12/5/18 10:49 AM, Mike Snitzer wrote:
> >>> On Wed, Dec 05 2018 at 12:30pm -0500,
> >>> Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>>> There's also no need to pass in the cpu, if we're not running with
> >>>> preempt disabled already we have a problem. 
> >>>
> >>> Why should this be any different than the part_stat_* interfaces?
> >>> __part_stat_add(), part_stat_read(), etc also use
> >>> per_cpu_ptr((part)->dkstats, (cpu) accessors.
> >>
> >> Maybe audit which ones actually need it? To answer the specific question,
> >> it's silly to pass in the cpu, if we're pinned already. That's true
> >> both programatically, but also for someone reading the code.
> > 
> > I understand you'd like to avoid excess interface baggage.  But seems to
> > me we'd be better off being consistent, when extending the percpu
> > portion of block core stats, and then do an incremental to clean it all
> > up.
> 
> The incremental should be done first in that case, it'd be silly to
> introduce something only to do a cleanup right after.

OK, all existing code for these percpu stats should follow the pattern:

  int cpu = part_stat_lock();

  <do percpu diskstats stuff>

  part_stat_unlock();

part_stat_lock() calls get_cpu() which does preempt_disable().  So to
your point: yes we have preempt disabled.  And yes we _could_ just use
smp_processor_id() in callers rather than pass 'cpu' to them.

Is that what you want to see?

Mike
Jens Axboe Dec. 5, 2018, 6:35 p.m. UTC | #7
On 12/5/18 11:18 AM, Mike Snitzer wrote:
> On Wed, Dec 05 2018 at  1:04pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
> 
>> On 12/5/18 11:03 AM, Mike Snitzer wrote:
>>> On Wed, Dec 05 2018 at 12:54pm -0500,
>>> Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>> On 12/5/18 10:49 AM, Mike Snitzer wrote:
>>>>> On Wed, Dec 05 2018 at 12:30pm -0500,
>>>>> Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>>> There's also no need to pass in the cpu, if we're not running with
>>>>>> preempt disabled already we have a problem. 
>>>>>
>>>>> Why should this be any different than the part_stat_* interfaces?
>>>>> __part_stat_add(), part_stat_read(), etc also use
>>>>> per_cpu_ptr((part)->dkstats, (cpu) accessors.
>>>>
>>>> Maybe audit which ones actually need it? To answer the specific question,
>>>> it's silly to pass in the cpu, if we're pinned already. That's true
>>>> both programatically, but also for someone reading the code.
>>>
>>> I understand you'd like to avoid excess interface baggage.  But seems to
>>> me we'd be better off being consistent, when extending the percpu
>>> portion of block core stats, and then do an incremental to clean it all
>>> up.
>>
>> The incremental should be done first in that case, it'd be silly to
>> introduce something only to do a cleanup right after.
> 
> OK, all existing code for these percpu stats should follow the pattern:
> 
>   int cpu = part_stat_lock();
> 
>   <do percpu diskstats stuff>
> 
>   part_stat_unlock();
> 
> part_stat_lock() calls get_cpu() which does preempt_disable().  So to
> your point: yes we have preempt disabled.  And yes we _could_ just use
> smp_processor_id() in callers rather than pass 'cpu' to them.
> 
> Is that what you want to see?

Something like that, yes.

Patch
diff mbox series

diff --git a/block/bio.c b/block/bio.c
index d5ef043a97aa..b25b4fef9900 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1688,7 +1688,7 @@  void generic_start_io_acct(struct request_queue *q, int op,
 	update_io_ticks(cpu, part, jiffies);
 	part_stat_inc(cpu, part, ios[sgrp]);
 	part_stat_add(cpu, part, sectors[sgrp], sectors);
-	part_inc_in_flight(q, part, op_is_write(op));
+	part_inc_in_flight(q, cpu, part, op_is_write(op));
 
 	part_stat_unlock();
 }
@@ -1705,7 +1705,7 @@  void generic_end_io_acct(struct request_queue *q, int req_op,
 	update_io_ticks(cpu, part, now);
 	part_stat_add(cpu, part, nsecs[sgrp], jiffies_to_nsecs(duration));
 	part_stat_add(cpu, part, time_in_queue, duration);
-	part_dec_in_flight(q, part, op_is_write(req_op));
+	part_dec_in_flight(q, cpu, part, op_is_write(req_op));
 
 	part_stat_unlock();
 }
diff --git a/block/blk-core.c b/block/blk-core.c
index 6bd4669f05fd..87f06672d9a7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1355,7 +1355,7 @@  void blk_account_io_done(struct request *req, u64 now)
 		part_stat_inc(cpu, part, ios[sgrp]);
 		part_stat_add(cpu, part, nsecs[sgrp], now - req->start_time_ns);
 		part_stat_add(cpu, part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
-		part_dec_in_flight(req->q, part, rq_data_dir(req));
+		part_dec_in_flight(req->q, cpu, part, rq_data_dir(req));
 
 		hd_struct_put(part);
 		part_stat_unlock();
@@ -1390,7 +1390,7 @@  void blk_account_io_start(struct request *rq, bool new_io)
 			part = &rq->rq_disk->part0;
 			hd_struct_get(part);
 		}
-		part_inc_in_flight(rq->q, part, rw);
+		part_inc_in_flight(rq->q, cpu, part, rw);
 		rq->part = part;
 	}
 
diff --git a/block/blk-merge.c b/block/blk-merge.c
index c278b6d18a24..c02386cdf0ca 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -690,7 +690,7 @@  static void blk_account_io_merge(struct request *req)
 		cpu = part_stat_lock();
 		part = req->part;
 
-		part_dec_in_flight(req->q, part, rq_data_dir(req));
+		part_dec_in_flight(req->q, cpu, part, rq_data_dir(req));
 
 		hd_struct_put(part);
 		part_stat_unlock();
diff --git a/block/genhd.c b/block/genhd.c
index cdf174d7d329..d4c9dd65def6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -45,53 +45,76 @@  static void disk_add_events(struct gendisk *disk);
 static void disk_del_events(struct gendisk *disk);
 static void disk_release_events(struct gendisk *disk);
 
-void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
+void part_inc_in_flight(struct request_queue *q, int cpu, struct hd_struct *part, int rw)
 {
 	if (queue_is_mq(q))
 		return;
 
-	atomic_inc(&part->in_flight[rw]);
+	local_inc(&per_cpu_ptr(part->dkstats, cpu)->in_flight[rw]);
 	if (part->partno)
-		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
+		local_inc(&per_cpu_ptr(part_to_disk(part)->part0.dkstats, cpu)->in_flight[rw]);
 }
 
-void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
+void part_dec_in_flight(struct request_queue *q, int cpu, struct hd_struct *part, int rw)
 {
 	if (queue_is_mq(q))
 		return;
 
-	atomic_dec(&part->in_flight[rw]);
+	local_dec(&per_cpu_ptr(part->dkstats, cpu)->in_flight[rw]);
 	if (part->partno)
-		atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
+		local_dec(&per_cpu_ptr(part_to_disk(part)->part0.dkstats, cpu)->in_flight[rw]);
 }
 
 void part_in_flight(struct request_queue *q, struct hd_struct *part,
 		    unsigned int inflight[2])
 {
+	int cpu;
+
 	if (queue_is_mq(q)) {
 		blk_mq_in_flight(q, part, inflight);
 		return;
 	}
 
-	inflight[0] = atomic_read(&part->in_flight[0]) +
-			atomic_read(&part->in_flight[1]);
+	inflight[0] = 0;
+	for_each_possible_cpu(cpu) {
+		inflight[0] +=	local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[0]) +
+				local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[1]);
+	}
+	if ((int)inflight[0] < 0)
+		inflight[0] = 0;
+
 	if (part->partno) {
 		part = &part_to_disk(part)->part0;
-		inflight[1] = atomic_read(&part->in_flight[0]) +
-				atomic_read(&part->in_flight[1]);
+		inflight[1] = 0;
+		for_each_possible_cpu(cpu) {
+			inflight[1] +=	local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[0]) +
+					local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[1]);
+		}
+		if ((int)inflight[1] < 0)
+			inflight[1] = 0;
 	}
 }
 
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
 		       unsigned int inflight[2])
 {
+	int cpu;
+
 	if (queue_is_mq(q)) {
 		blk_mq_in_flight_rw(q, part, inflight);
 		return;
 	}
 
-	inflight[0] = atomic_read(&part->in_flight[0]);
-	inflight[1] = atomic_read(&part->in_flight[1]);
+	inflight[0] = 0;
+	inflight[1] = 0;
+	for_each_possible_cpu(cpu) {
+		inflight[0] += local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[0]);
+		inflight[1] += local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[1]);
+	}
+	if ((int)inflight[0] < 0)
+		inflight[0] = 0;
+	if ((int)inflight[1] < 0)
+		inflight[1] = 0;
 }
 
 struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index f2a0a52c874f..a03aa6502a83 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -17,6 +17,7 @@ 
 #include <linux/percpu-refcount.h>
 #include <linux/uuid.h>
 #include <linux/blk_types.h>
+#include <asm/local.h>
 
 #ifdef CONFIG_BLOCK
 
@@ -89,6 +90,7 @@  struct disk_stats {
 	unsigned long merges[NR_STAT_GROUPS];
 	unsigned long io_ticks;
 	unsigned long time_in_queue;
+	local_t in_flight[2];
 };
 
 #define PARTITION_META_INFO_VOLNAMELTH	64
@@ -122,7 +124,6 @@  struct hd_struct {
 	int make_it_fail;
 #endif
 	unsigned long stamp;
-	atomic_t in_flight[2];
 #ifdef	CONFIG_SMP
 	struct disk_stats __percpu *dkstats;
 #else
@@ -380,9 +381,9 @@  void part_in_flight(struct request_queue *q, struct hd_struct *part,
 		    unsigned int inflight[2]);
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
 		       unsigned int inflight[2]);
-void part_dec_in_flight(struct request_queue *q, struct hd_struct *part,
+void part_dec_in_flight(struct request_queue *q, int cpu, struct hd_struct *part,
 			int rw);
-void part_inc_in_flight(struct request_queue *q, struct hd_struct *part,
+void part_inc_in_flight(struct request_queue *q, int cpu, struct hd_struct *part,
 			int rw);
 
 static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)