Message ID | 20181106213858.391264280@debian-a64.vm (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Mike Snitzer |
Headers | show |
Series | device mapper percpu patches | expand |
On Tue, Nov 06, 2018 at 10:35:03PM +0100, Mikulas Patocka wrote: > Device mapper was converted to percpu inflight counters. In order to > display the correct values in the "inflight" sysfs file, we need a custom > callback that sums the percpu counters. The attribute that calls this is per-partition, while your method is per-queue, so there is some impedence mismatch here. Is there any way you could look into just making the generic code use percpu counters? Also please cc linux-block on patches that touch the generic block code. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Nov 08 2018 at 9:52am -0500, Christoph Hellwig <hch@infradead.org> wrote: > On Tue, Nov 06, 2018 at 10:35:03PM +0100, Mikulas Patocka wrote: > > Device mapper was converted to percpu inflight counters. In order to > > display the correct values in the "inflight" sysfs file, we need a custom > > callback that sums the percpu counters. > > The attribute that calls this is per-partition, while your method > is per-queue, so there is some impedence mismatch here. > > Is there any way you could look into just making the generic code > use percpu counters? Discussed doing that with Jens and reported as much here: https://www.redhat.com/archives/dm-devel/2018-November/msg00068.html And Jens gave additional context for why yet another attempt to switch block core's in_flight to percpu counters is doomed (having already been proposed and rejected twice): https://www.redhat.com/archives/dm-devel/2018-November/msg00071.html And yes, definitely should've cc'd linux-block (now added). Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Nov 08, 2018 at 12:07:01PM -0500, Mike Snitzer wrote: > Discussed doing that with Jens and reported as much here: > > https://www.redhat.com/archives/dm-devel/2018-November/msg00068.html > > And Jens gave additional context for why yet another attempt to switch > block core's in_flight to percpu counters is doomed (having already been > proposed and rejected twice): > > https://www.redhat.com/archives/dm-devel/2018-November/msg00071.html > > And yes, definitely should've cc'd linux-block (now added). So how is dm different from the the other 3 handful of drivers using the make_request interface that the per-cpu counters work for dm and not the others? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Nov 14 2018 at 10:18am -0500, Christoph Hellwig <hch@infradead.org> wrote: > On Thu, Nov 08, 2018 at 12:07:01PM -0500, Mike Snitzer wrote: > > Discussed doing that with Jens and reported as much here: > > > > https://www.redhat.com/archives/dm-devel/2018-November/msg00068.html > > > > And Jens gave additional context for why yet another attempt to switch > > block core's in_flight to percpu counters is doomed (having already been > > proposed and rejected twice): > > > > https://www.redhat.com/archives/dm-devel/2018-November/msg00071.html > > > > And yes, definitely should've cc'd linux-block (now added). > > So how is dm different from the the other 3 handful of drivers using > the make_request interface that the per-cpu counters work for dm and > not the others? Think the big part of the historic reluctance to switch to percpu in_flight counters was that until now (4.21) the legacy request path was also using the in_flight counters. Now that they are only used by bio-based (make_request) we likely have more latitude (hopefully?). Though I cannot say for sure why they performed so well in Mikulas' testing.. you'd thinking all the percpu summing on every jiffie during IO completion would've still been costly.. but Mikulas saw great results. Mikulas and I have discussed a new way forward and he is actively working through implementing it. Basically he'll still switch to percpu in_flight counters but he'll change the algorithm for IO accounting during completion so that it is more of an approximation of the historically precise in_flight counters and io_ticks (io_ticks is another problematic component that gets in the way of performance). Basically the accounting done during IO completion would be much much faster. Big part of this is the summation of the percpu in_flight counters would happen on demand (via sysfs or /proc/diskstats access). I could look back at my logs from my chat with Mikulas to give you more details or we could just wait for Mikulas to post the patches (hopefully within a week). Your call. Coming off my Monday discussion with Mikulas I really think the approach will work nicely and offer a nice performance win for bio-based. Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 8 Nov 2018, Christoph Hellwig wrote: > On Tue, Nov 06, 2018 at 10:35:03PM +0100, Mikulas Patocka wrote: > > Device mapper was converted to percpu inflight counters. In order to > > display the correct values in the "inflight" sysfs file, we need a custom > > callback that sums the percpu counters. > > The attribute that calls this is per-partition, while your method > is per-queue, so there is some impedence mismatch here. Device mapper doesn't use partitions. > Is there any way you could look into just making the generic code > use percpu counters? In the next merge window, single-queue request-based block drivers will be eliminated - all the drivers will be multiqueue. So, they won't use the in_flight variables at all. in_flight will be only use by bio-based stacking drivers like md. > Also please cc linux-block on patches that touch the generic block > code. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 14 Nov 2018, Christoph Hellwig wrote: > On Thu, Nov 08, 2018 at 12:07:01PM -0500, Mike Snitzer wrote: > > Discussed doing that with Jens and reported as much here: > > > > https://www.redhat.com/archives/dm-devel/2018-November/msg00068.html > > > > And Jens gave additional context for why yet another attempt to switch > > block core's in_flight to percpu counters is doomed (having already been > > proposed and rejected twice): > > > > https://www.redhat.com/archives/dm-devel/2018-November/msg00071.html > > > > And yes, definitely should've cc'd linux-block (now added). > > So how is dm different from the the other 3 handful of drivers using > the make_request interface that the per-cpu counters work for dm and > not the others? We want to make dm-linear (and dm-stripe) completely lockless, because it is used often and we don't want it to degrade performance. DM already uses srcu to handle table changes, so that the fast path doesn't take any locks. And the only one "lock" that is remaining is the "in_flight" variable. As for other drivers, md-raid0 could probably be lockless too (using percpu counting similar to dm). The other raid levels can't be lockless because they need to check the status of the stripe that is being accessed. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Index: linux-2.6/block/genhd.c =================================================================== --- linux-2.6.orig/block/genhd.c 2018-11-06 22:31:46.350000000 +0100 +++ linux-2.6/block/genhd.c 2018-11-06 22:31:46.350000000 +0100 @@ -85,6 +85,11 @@ void part_in_flight(struct request_queue void part_in_flight_rw(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]) { + if (q->get_inflight_fn) { + q->get_inflight_fn(q, inflight); + return; + } + if (q->mq_ops) { blk_mq_in_flight_rw(q, part, inflight); return; Index: linux-2.6/include/linux/blkdev.h =================================================================== --- linux-2.6.orig/include/linux/blkdev.h 2018-11-06 22:31:46.350000000 +0100 +++ linux-2.6/include/linux/blkdev.h 2018-11-06 22:31:46.350000000 +0100 @@ -324,6 +324,7 @@ typedef int (lld_busy_fn) (struct reques typedef int (bsg_job_fn) (struct bsg_job *); typedef int (init_rq_fn)(struct request_queue *, struct request *, gfp_t); typedef void (exit_rq_fn)(struct request_queue *, struct request *); +typedef void (get_inflight_fn)(struct request_queue *, unsigned int [2]); enum blk_eh_timer_return { BLK_EH_DONE, /* drivers has completed the command */ @@ -466,6 +467,8 @@ struct request_queue { exit_rq_fn *exit_rq_fn; /* Called from inside blk_get_request() */ void (*initialize_rq_fn)(struct request *rq); + /* Called to get the "inflight" values */ + get_inflight_fn *get_inflight_fn; const struct blk_mq_ops *mq_ops; @@ -1232,6 +1235,7 @@ extern int blk_queue_dma_drain(struct re dma_drain_needed_fn *dma_drain_needed, void *buf, unsigned int size); extern void blk_queue_lld_busy(struct request_queue *q, lld_busy_fn *fn); +extern void blk_queue_get_inflight(struct request_queue *q, get_inflight_fn *fn); extern void blk_queue_segment_boundary(struct request_queue *, unsigned long); extern void blk_queue_virt_boundary(struct request_queue *, unsigned long); extern void blk_queue_prep_rq(struct request_queue *, prep_rq_fn *pfn); Index: linux-2.6/block/blk-settings.c =================================================================== --- linux-2.6.orig/block/blk-settings.c 2018-11-06 22:31:46.350000000 +0100 +++ linux-2.6/block/blk-settings.c 2018-11-06 22:31:46.350000000 +0100 @@ -79,6 +79,12 @@ void blk_queue_lld_busy(struct request_q } EXPORT_SYMBOL_GPL(blk_queue_lld_busy); +void blk_queue_get_inflight(struct request_queue *q, get_inflight_fn *fn) +{ + q->get_inflight_fn = fn; +} +EXPORT_SYMBOL_GPL(blk_queue_get_inflight); + /** * blk_set_default_limits - reset limits to default values * @lim: the queue_limits structure to reset Index: linux-2.6/drivers/md/dm.c =================================================================== --- linux-2.6.orig/drivers/md/dm.c 2018-11-06 22:31:46.350000000 +0100 +++ linux-2.6/drivers/md/dm.c 2018-11-06 22:31:46.350000000 +0100 @@ -662,6 +662,23 @@ static void end_io_acct(struct dm_io *io } } +static void dm_get_inflight(struct request_queue *q, unsigned int inflight[2]) +{ + struct mapped_device *md = q->queuedata; + int cpu; + + inflight[READ] = inflight[WRITE] = 0; + for_each_possible_cpu(cpu) { + struct dm_percpu *p = per_cpu_ptr(md->counters, cpu); + inflight[READ] += p->inflight[READ]; + inflight[WRITE] += p->inflight[WRITE]; + } + if ((int)inflight[READ] < 0) + inflight[READ] = 0; + if ((int)inflight[WRITE] < 0) + inflight[WRITE] = 0; +} + /* * Add the bio to the list of deferred io. */ @@ -2242,6 +2259,7 @@ int dm_setup_md_queue(struct mapped_devi case DM_TYPE_NVME_BIO_BASED: dm_init_normal_md_queue(md); blk_queue_make_request(md->queue, dm_make_request); + blk_queue_get_inflight(md->queue, dm_get_inflight); break; case DM_TYPE_NONE: WARN_ON_ONCE(true);
Device mapper was converted to percpu inflight counters. In order to display the correct values in the "inflight" sysfs file, we need a custom callback that sums the percpu counters. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- block/blk-settings.c | 6 ++++++ block/genhd.c | 5 +++++ drivers/md/dm.c | 18 ++++++++++++++++++ include/linux/blkdev.h | 4 ++++ 4 files changed, 33 insertions(+) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel