diff mbox series

[5/5] block: use a driver-specific handler for the "inflight" value

Message ID 20181106213858.391264280@debian-a64.vm (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show
Series device mapper percpu patches | expand

Commit Message

Mikulas Patocka Nov. 6, 2018, 9:35 p.m. UTC
Device mapper was converted to percpu inflight counters. In order to
display the correct values in the "inflight" sysfs file, we need a custom
callback that sums the percpu counters.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 block/blk-settings.c   |    6 ++++++
 block/genhd.c          |    5 +++++
 drivers/md/dm.c        |   18 ++++++++++++++++++
 include/linux/blkdev.h |    4 ++++
 4 files changed, 33 insertions(+)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Christoph Hellwig Nov. 8, 2018, 2:52 p.m. UTC | #1
On Tue, Nov 06, 2018 at 10:35:03PM +0100, Mikulas Patocka wrote:
> Device mapper was converted to percpu inflight counters. In order to
> display the correct values in the "inflight" sysfs file, we need a custom
> callback that sums the percpu counters.

The attribute that calls this is per-partition, while your method
is per-queue, so there is some impedence mismatch here.

Is there any way you could look into just making the generic code
use percpu counters?

Also please cc linux-block on patches that touch the generic block
code.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Nov. 8, 2018, 5:07 p.m. UTC | #2
On Thu, Nov 08 2018 at  9:52am -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Nov 06, 2018 at 10:35:03PM +0100, Mikulas Patocka wrote:
> > Device mapper was converted to percpu inflight counters. In order to
> > display the correct values in the "inflight" sysfs file, we need a custom
> > callback that sums the percpu counters.
> 
> The attribute that calls this is per-partition, while your method
> is per-queue, so there is some impedence mismatch here.
> 
> Is there any way you could look into just making the generic code
> use percpu counters?

Discussed doing that with Jens and reported as much here:

https://www.redhat.com/archives/dm-devel/2018-November/msg00068.html

And Jens gave additional context for why yet another attempt to switch
block core's in_flight to percpu counters is doomed (having already been
proposed and rejected twice):

https://www.redhat.com/archives/dm-devel/2018-November/msg00071.html

And yes, definitely should've cc'd linux-block (now added).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Hellwig Nov. 14, 2018, 3:18 p.m. UTC | #3
On Thu, Nov 08, 2018 at 12:07:01PM -0500, Mike Snitzer wrote:
> Discussed doing that with Jens and reported as much here:
> 
> https://www.redhat.com/archives/dm-devel/2018-November/msg00068.html
> 
> And Jens gave additional context for why yet another attempt to switch
> block core's in_flight to percpu counters is doomed (having already been
> proposed and rejected twice):
> 
> https://www.redhat.com/archives/dm-devel/2018-November/msg00071.html
> 
> And yes, definitely should've cc'd linux-block (now added).

So how is dm different from the the other 3 handful of drivers using
the make_request interface that the per-cpu counters work for dm and
not the others?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Nov. 14, 2018, 3:34 p.m. UTC | #4
On Wed, Nov 14 2018 at 10:18am -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Nov 08, 2018 at 12:07:01PM -0500, Mike Snitzer wrote:
> > Discussed doing that with Jens and reported as much here:
> > 
> > https://www.redhat.com/archives/dm-devel/2018-November/msg00068.html
> > 
> > And Jens gave additional context for why yet another attempt to switch
> > block core's in_flight to percpu counters is doomed (having already been
> > proposed and rejected twice):
> > 
> > https://www.redhat.com/archives/dm-devel/2018-November/msg00071.html
> > 
> > And yes, definitely should've cc'd linux-block (now added).
> 
> So how is dm different from the the other 3 handful of drivers using
> the make_request interface that the per-cpu counters work for dm and
> not the others?

Think the big part of the historic reluctance to switch to percpu
in_flight counters was that until now (4.21) the legacy request path was
also using the in_flight counters.

Now that they are only used by bio-based (make_request) we likely have
more latitude (hopefully?).  Though I cannot say for sure why they
performed so well in Mikulas' testing.. you'd thinking all the percpu
summing on every jiffie during IO completion would've still been
costly.. but Mikulas saw great results.

Mikulas and I have discussed a new way forward and he is actively
working through implementing it.  Basically he'll still switch to percpu
in_flight counters but he'll change the algorithm for IO accounting
during completion so that it is more of an approximation of the
historically precise in_flight counters and io_ticks (io_ticks is
another problematic component that gets in the way of performance).
Basically the accounting done during IO completion would be much much
faster.  Big part of this is the summation of the percpu in_flight
counters would happen on demand (via sysfs or /proc/diskstats access).
I could look back at my logs from my chat with Mikulas to give you more
details or we could just wait for Mikulas to post the patches (hopefully
within a week).  Your call.

Coming off my Monday discussion with Mikulas I really think the approach
will work nicely and offer a nice performance win for bio-based.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 14, 2018, 10:35 p.m. UTC | #5
On Thu, 8 Nov 2018, Christoph Hellwig wrote:

> On Tue, Nov 06, 2018 at 10:35:03PM +0100, Mikulas Patocka wrote:
> > Device mapper was converted to percpu inflight counters. In order to
> > display the correct values in the "inflight" sysfs file, we need a custom
> > callback that sums the percpu counters.
> 
> The attribute that calls this is per-partition, while your method
> is per-queue, so there is some impedence mismatch here.

Device mapper doesn't use partitions.

> Is there any way you could look into just making the generic code
> use percpu counters?

In the next merge window, single-queue request-based block drivers will be 
eliminated - all the drivers will be multiqueue. So, they won't use 
the in_flight variables at all.

in_flight will be only use by bio-based stacking drivers like md.

> Also please cc linux-block on patches that touch the generic block
> code.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 14, 2018, 10:49 p.m. UTC | #6
On Wed, 14 Nov 2018, Christoph Hellwig wrote:

> On Thu, Nov 08, 2018 at 12:07:01PM -0500, Mike Snitzer wrote:
> > Discussed doing that with Jens and reported as much here:
> > 
> > https://www.redhat.com/archives/dm-devel/2018-November/msg00068.html
> > 
> > And Jens gave additional context for why yet another attempt to switch
> > block core's in_flight to percpu counters is doomed (having already been
> > proposed and rejected twice):
> > 
> > https://www.redhat.com/archives/dm-devel/2018-November/msg00071.html
> > 
> > And yes, definitely should've cc'd linux-block (now added).
> 
> So how is dm different from the the other 3 handful of drivers using
> the make_request interface that the per-cpu counters work for dm and
> not the others?

We want to make dm-linear (and dm-stripe) completely lockless, because it 
is used often and we don't want it to degrade performance.

DM already uses srcu to handle table changes, so that the fast path 
doesn't take any locks. And the only one "lock" that is remaining is the 
"in_flight" variable.

As for other drivers, md-raid0 could probably be lockless too (using 
percpu counting similar to dm). The other raid levels can't be lockless 
because they need to check the status of the stripe that is being 
accessed.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox series

Patch

Index: linux-2.6/block/genhd.c
===================================================================
--- linux-2.6.orig/block/genhd.c	2018-11-06 22:31:46.350000000 +0100
+++ linux-2.6/block/genhd.c	2018-11-06 22:31:46.350000000 +0100
@@ -85,6 +85,11 @@  void part_in_flight(struct request_queue
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
 		       unsigned int inflight[2])
 {
+	if (q->get_inflight_fn) {
+		q->get_inflight_fn(q, inflight);
+		return;
+	}
+
 	if (q->mq_ops) {
 		blk_mq_in_flight_rw(q, part, inflight);
 		return;
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2018-11-06 22:31:46.350000000 +0100
+++ linux-2.6/include/linux/blkdev.h	2018-11-06 22:31:46.350000000 +0100
@@ -324,6 +324,7 @@  typedef int (lld_busy_fn) (struct reques
 typedef int (bsg_job_fn) (struct bsg_job *);
 typedef int (init_rq_fn)(struct request_queue *, struct request *, gfp_t);
 typedef void (exit_rq_fn)(struct request_queue *, struct request *);
+typedef void (get_inflight_fn)(struct request_queue *, unsigned int [2]);
 
 enum blk_eh_timer_return {
 	BLK_EH_DONE,		/* drivers has completed the command */
@@ -466,6 +467,8 @@  struct request_queue {
 	exit_rq_fn		*exit_rq_fn;
 	/* Called from inside blk_get_request() */
 	void (*initialize_rq_fn)(struct request *rq);
+	/* Called to get the "inflight" values */
+	get_inflight_fn		*get_inflight_fn;
 
 	const struct blk_mq_ops	*mq_ops;
 
@@ -1232,6 +1235,7 @@  extern int blk_queue_dma_drain(struct re
 			       dma_drain_needed_fn *dma_drain_needed,
 			       void *buf, unsigned int size);
 extern void blk_queue_lld_busy(struct request_queue *q, lld_busy_fn *fn);
+extern void blk_queue_get_inflight(struct request_queue *q, get_inflight_fn *fn);
 extern void blk_queue_segment_boundary(struct request_queue *, unsigned long);
 extern void blk_queue_virt_boundary(struct request_queue *, unsigned long);
 extern void blk_queue_prep_rq(struct request_queue *, prep_rq_fn *pfn);
Index: linux-2.6/block/blk-settings.c
===================================================================
--- linux-2.6.orig/block/blk-settings.c	2018-11-06 22:31:46.350000000 +0100
+++ linux-2.6/block/blk-settings.c	2018-11-06 22:31:46.350000000 +0100
@@ -79,6 +79,12 @@  void blk_queue_lld_busy(struct request_q
 }
 EXPORT_SYMBOL_GPL(blk_queue_lld_busy);
 
+void blk_queue_get_inflight(struct request_queue *q, get_inflight_fn *fn)
+{
+	q->get_inflight_fn = fn;
+}
+EXPORT_SYMBOL_GPL(blk_queue_get_inflight);
+
 /**
  * blk_set_default_limits - reset limits to default values
  * @lim:  the queue_limits structure to reset
Index: linux-2.6/drivers/md/dm.c
===================================================================
--- linux-2.6.orig/drivers/md/dm.c	2018-11-06 22:31:46.350000000 +0100
+++ linux-2.6/drivers/md/dm.c	2018-11-06 22:31:46.350000000 +0100
@@ -662,6 +662,23 @@  static void end_io_acct(struct dm_io *io
 	}
 }
 
+static void dm_get_inflight(struct request_queue *q, unsigned int inflight[2])
+{
+	struct mapped_device *md = q->queuedata;
+	int cpu;
+
+	inflight[READ] = inflight[WRITE] = 0;
+	for_each_possible_cpu(cpu) {
+		struct dm_percpu *p = per_cpu_ptr(md->counters, cpu);
+		inflight[READ] += p->inflight[READ];
+		inflight[WRITE] += p->inflight[WRITE];
+	}
+	if ((int)inflight[READ] < 0)
+		inflight[READ] = 0;
+	if ((int)inflight[WRITE] < 0)
+		inflight[WRITE] = 0;
+}
+
 /*
  * Add the bio to the list of deferred io.
  */
@@ -2242,6 +2259,7 @@  int dm_setup_md_queue(struct mapped_devi
 	case DM_TYPE_NVME_BIO_BASED:
 		dm_init_normal_md_queue(md);
 		blk_queue_make_request(md->queue, dm_make_request);
+		blk_queue_get_inflight(md->queue, dm_get_inflight);
 		break;
 	case DM_TYPE_NONE:
 		WARN_ON_ONCE(true);