Message ID | 20221221040506.1174644-2-gulam.mohamed@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [for-6.2/block,V3,1/2] block: Data type conversion for IO accounting | expand |
On Wed, Dec 21, 2022 at 04:05:06AM +0000, Gulam Mohamed wrote: > +u64 blk_get_iostat_ticks(struct request_queue *q) > +{ > + return (blk_queue_precise_io_stat(q) ? ktime_get_ns() : jiffies); > +} > +EXPORT_SYMBOL_GPL(blk_get_iostat_ticks); > + > void update_io_ticks(struct block_device *part, u64 now, bool end) > { > u64 stamp; > @@ -968,20 +980,26 @@ EXPORT_SYMBOL(bdev_start_io_acct); > u64 bio_start_io_acct(struct bio *bio) > { > return bdev_start_io_acct(bio->bi_bdev, bio_sectors(bio), > - bio_op(bio), jiffies); > + bio_op(bio), > + blk_get_iostat_ticks(bio->bi_bdev->bd_queue)); > } > EXPORT_SYMBOL_GPL(bio_start_io_acct); > > void bdev_end_io_acct(struct block_device *bdev, enum req_op op, > u64 start_time) > { > + u64 now; > + u64 duration; > + struct request_queue *q = bdev_get_queue(bdev); > const int sgrp = op_stat_group(op); > - u64 now = READ_ONCE(jiffies); > - u64 duration = (unsigned long)now -(unsigned long) start_time; > + now = blk_get_iostat_ticks(q);; I don't think you can rely on the blk_queue_precise_io_stat() flag in the completion side. The user can toggle this with data in flight, which means the completion may use different tick units than the start. I think you'll need to add a flag to the request in the start accounting side to know which method to use for the completion. > @@ -951,6 +951,7 @@ ssize_t part_stat_show(struct device *dev, > struct request_queue *q = bdev_get_queue(bdev); > struct disk_stats stat; > unsigned int inflight; > + u64 stat_ioticks; > > if (queue_is_mq(q)) > inflight = blk_mq_in_flight(q, bdev); > @@ -959,10 +960,13 @@ ssize_t part_stat_show(struct device *dev, > > if (inflight) { > part_stat_lock(); > - update_io_ticks(bdev, jiffies, true); > + update_io_ticks(bdev, blk_get_iostat_ticks(q), true); > part_stat_unlock(); > } > part_stat_read_all(bdev, &stat); > + stat_ioticks = (blk_queue_precise_io_stat(q) ? > + div_u64(stat.io_ticks, NSEC_PER_MSEC) : > + jiffies_to_msecs(stat.io_ticks)); With the user able to change the precision at will, I think these io_ticks need to have a consistent unit size. Mixing jiffies and nsecs is going to create garbage stats output. Could existing io_ticks using jiffies be converted to jiffies_to_nsecs() so that you only have one unit to consider?
Hi Keith, Thanks for reviewing this request. Can you please see my inline comments? Regards, Gulam Mohamed. -----Original Message----- From: Keith Busch <kbusch@kernel.org> Sent: Wednesday, December 21, 2022 10:39 PM To: Gulam Mohamed <gulam.mohamed@oracle.com> Cc: linux-block@vger.kernel.org; axboe@kernel.dk; philipp.reisner@linbit.com; lars.ellenberg@linbit.com; christoph.boehmwalder@linbit.com; minchan@kernel.org; ngupta@vflare.org; senozhatsky@chromium.org; colyli@suse.de; kent.overstreet@gmail.com; agk@redhat.com; snitzer@kernel.org; dm-devel@redhat.com; song@kernel.org; dan.j.williams@intel.com; vishal.l.verma@intel.com; dave.jiang@intel.com; ira.weiny@intel.com; Junxiao Bi <junxiao.bi@oracle.com>; Martin Petersen <martin.petersen@oracle.com>; kch@nvidia.com; drbd-dev@lists.linbit.com; linux-kernel@vger.kernel.org; linux-bcache@vger.kernel.org; linux-raid@vger.kernel.org; nvdimm@lists.linux.dev; Konrad Wilk <konrad.wilk@oracle.com>; Joe Jin <joe.jin@oracle.com>; Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>; Shminderjit Singh <shminderjit.singh@oracle.com> Subject: Re: [PATCH for-6.2/block V3 2/2] block: Change the granularity of io ticks from ms to ns On Wed, Dec 21, 2022 at 04:05:06AM +0000, Gulam Mohamed wrote: > +u64 blk_get_iostat_ticks(struct request_queue *q) { > + return (blk_queue_precise_io_stat(q) ? ktime_get_ns() : > +jiffies); } EXPORT_SYMBOL_GPL(blk_get_iostat_ticks); > + > void update_io_ticks(struct block_device *part, u64 now, bool end) { > u64 stamp; > @@ -968,20 +980,26 @@ EXPORT_SYMBOL(bdev_start_io_acct); > u64 bio_start_io_acct(struct bio *bio) { > return bdev_start_io_acct(bio->bi_bdev, bio_sectors(bio), > - bio_op(bio), jiffies); > + bio_op(bio), > + blk_get_iostat_ticks(bio->bi_bdev->bd_queue)); > } > EXPORT_SYMBOL_GPL(bio_start_io_acct); > > void bdev_end_io_acct(struct block_device *bdev, enum req_op op, > u64 start_time) > { > + u64 now; > + u64 duration; > + struct request_queue *q = bdev_get_queue(bdev); > const int sgrp = op_stat_group(op); > - u64 now = READ_ONCE(jiffies); > - u64 duration = (unsigned long)now -(unsigned long) start_time; > + now = blk_get_iostat_ticks(q);; I don't think you can rely on the blk_queue_precise_io_stat() flag in the completion side. The user can toggle this with data in flight, which means the completion may use different tick units than the start. I think you'll need to add a flag to the request in the start accounting side to know which method to use for the completion. [GULAM]: As per my understanding, this may work for a single request_queue implemetation. But this request based accounting, as per my understanding, may be an issue with the Multi-QUEUE as there is a separate queue for each CPU and the time-stamp being recorded for the block device is a global one. Also, the issue you mentioned about the start and end accounting may update the ticks in different units for the inflight IOs, may be just for a while. So, even if it works for MQ, I am trying to understand how much is it feasible to do this request-based change for an issue which may be there for just a moment? So, can you please correct me if I am wrong and explore more on your suggestion so that I can understand properly? > @@ -951,6 +951,7 @@ ssize_t part_stat_show(struct device *dev, > struct request_queue *q = bdev_get_queue(bdev); > struct disk_stats stat; > unsigned int inflight; > + u64 stat_ioticks; > > if (queue_is_mq(q)) > inflight = blk_mq_in_flight(q, bdev); @@ -959,10 +960,13 @@ ssize_t > part_stat_show(struct device *dev, > > if (inflight) { > part_stat_lock(); > - update_io_ticks(bdev, jiffies, true); > + update_io_ticks(bdev, blk_get_iostat_ticks(q), true); > part_stat_unlock(); > } > part_stat_read_all(bdev, &stat); > + stat_ioticks = (blk_queue_precise_io_stat(q) ? > + div_u64(stat.io_ticks, NSEC_PER_MSEC) : > + jiffies_to_msecs(stat.io_ticks)); With the user able to change the precision at will, I think these io_ticks need to have a consistent unit size. Mixing jiffies and nsecs is going to create garbage stats output. Could existing io_ticks using jiffies be converted to jiffies_to_nsecs() so that you only have one unit to consider? [GULAM]: I am not sure if this will work as we just multiply with 1000000 to convert jiffies to nano-seconds.
[comments/questions inlined below] On Tue, Dec 20 2022 at 11:05P -0500, Gulam Mohamed <gulam.mohamed@oracle.com> wrote: > Problem Desc > ============ > The "iostat" user-space utility was showing %util as 100% for the disks > which has latencies less than a milli-second i.e for latencies in the > range of micro-seconds and below. > > Root Cause > ========== > The IO accounting in block layer is currently done by updating the > io_ticks in jiffies which is of milli-seconds granularity. Due to this, > for the devices with IO latencies less than a milli-second, the latency > will be accounted as 1 milli-second even-though its in the range of > micro-seconds. This was causing the iostat command to show %util > as 100% which is incorrect. > > Recreationg of the issue > ======================== > Setup > ----- > Devices: NVMe 24 devices > Model number: 4610 (Intel) > > fio > --- > [global] > bs=4K > iodepth=1 > direct=1 > ioengine=libaio > group_reporting > time_based > runtime=100 > thinktime=1ms > numjobs=1 > name=raw-write > rw=randrw > ignore_error=EIO:EIO > [job1] > filename=/dev/nvme0n1 > > iostat o/p > ---------- > > Device %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util > nvme3n1 0.00 0.05 0.00 75.38 0.50 0.00 0.00 100.00 > > Device %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util > nvme3n1 0.00 0.05 0.00 74.74 0.50 0.00 0.00 100.00 > > Solution > ======== > Use ktime_get_ns() to update the disk_stats io_ticks so that the io_ticks > are updated for every io start and end times. > > Issues using ktime > ================== > > Using ktime_get_ns() has a performance overhead as ktime will go ahead > and reads the clock everytime its called. Following test environment > was used by Jens Axboe on which t/io_uring was run which has shown a > high performance drop. > > Devices > ------- > SSDs: P5800X > No of devices: 24 > > io_uring config > --------------- > Polled IO > iostats: Enabled > Reads: Random > Block Size: 512 > QDepth: 128 > Batch Submit: 32 > Batch Complete: 32 > No of Threads: 24 > > With the above environment and ktime patch, it has shown a performance > drop of ~25% from iostats disabled and ~19% performance drop from current > iostats enabled. This performance drop is high. > > Suggestion from Jens Axboe > ========================== > Jens Axboe suggested to split the bigger patch into two as follows: > > 1. In first patch, change all the types from unsigned long to u64, that > can be done while retaining jiffies. > > 2. In second patch, add an iostats == 2 setting, which enables the higher > resolution mode using ktime. We'd still default to 1, lower granularity > iostats enabled. > > Fix details > =========== > 1. Use ktime_get_ns() to get the current nano-seconds to update the > io_ticks for start and end time stamps in block layer for io accounting > > 2. Create a new setting '2' in sysfs for iostats variable i.e for > /sys/block/<device-name>/queue/iostats, to enable the iostat (io > accounting) with nano-seconds (using ktime) granularity. This setting > should be enabled only if the iostat is needed with high resolution > mode as it has a high performance drop > > 3. Earlier available settings were 0 and 1 for disable and enable io > accounting with milli-seconds granularity (jiffies) > > Testing > ======= > Ran the t/io_uring command with following setup: > > Devices > ------- > SSDs: P4610 > No of devices: 8 > > io_uring config > --------------- > Polled IO > iostats: Enabled > Reads: Random > Block Size: 512 > QDepth: 128 > Batch Submit: 32 > Batch Complete: 32 > No of Threads: 24 > > io_uring o/p > ------------ > iostat=0, with patch: Maximum IOPS=10.09M > iostat=1, with patch: Maximum IOPS=9.84M > iostat=2, with patch: Maximum IOPS=9.48M > > Changes from V2 to V3 > ===================== > 1. Changed all the required variables data-type to u64 as a first patch > 2. Create a new setting '2' for iostats in sysfs in this patch > 3. Change the code to get the ktime values when iostat=2, in this patch > > Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com> > --- > block/blk-core.c | 26 +++++++++++++++++---- > block/blk-mq.c | 4 ++-- > block/blk-sysfs.c | 39 ++++++++++++++++++++++++++++++- > block/genhd.c | 18 ++++++++++---- > drivers/block/drbd/drbd_debugfs.c | 12 ++++++++-- > drivers/block/zram/zram_drv.c | 3 ++- > drivers/md/dm.c | 13 +++++++++-- > include/linux/blkdev.h | 4 ++++ > 8 files changed, 103 insertions(+), 16 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 5670032fe932..0b5e4eb909a5 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -927,6 +927,18 @@ int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob, > } > EXPORT_SYMBOL_GPL(iocb_bio_iopoll); > > +/* > + * Get the time based upon the available granularity for io accounting > + * If the resolution mode is set to precise (2) i.e > + * (/sys/block/<device>/queue/iostats = 2), then this will return time > + * in nano-seconds else this will return time in jiffies > + */ > +u64 blk_get_iostat_ticks(struct request_queue *q) > +{ > + return (blk_queue_precise_io_stat(q) ? ktime_get_ns() : jiffies); > +} > +EXPORT_SYMBOL_GPL(blk_get_iostat_ticks); > + Would prefer to see blk_get_iostat_ticks tagged with 'static inline' and moved to blkdev.h (obviously below blk_queue_precise_io_stat). Also, just a general comment: I've historically been a bigger (ab)user of jump_labels to avoid excess branching costs per IO, e.g. see commit 442761fd2b29 ("dm: conditionally enable branching for less used features"). It could be there is an opportunity for a follow-on patch that leverages them? Or should that just be left to each driver to decide to do with its own resources (rather than have block core provide that)? (if nothing in the system ever uses QUEUE_FLAG_PRECISE_IO_STAT then it'd be nice to avoid needless branching). <snip> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index 011a85ea40da..1bb58a0b8cd1 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -482,7 +482,11 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode, > > u64 dm_start_time_ns_from_clone(struct bio *bio) > { > - return jiffies_to_nsecs(clone_to_tio(bio)->io->start_time); > + struct request_queue *q = bdev_get_queue(bio->bi_bdev); > + u64 start_time_ns = (blk_queue_precise_io_stat(q) ? > + clone_to_tio(bio)->io->start_time : > + jiffies_to_nsecs(clone_to_tio(bio)->io->start_time)); > + return start_time_ns; > } > EXPORT_SYMBOL_GPL(dm_start_time_ns_from_clone); > > @@ -498,6 +502,11 @@ static void dm_io_acct(struct dm_io *io, bool end) > struct mapped_device *md = io->md; > struct bio *bio = io->orig_bio; > unsigned int sectors; > + struct request_queue *q = bdev_get_queue(bio->bi_bdev); > + > + u64 start_time = (blk_queue_precise_io_stat(q) ? > + nsecs_to_jiffies(io->start_time) : > + io->start_time); > > /* > * If REQ_PREFLUSH set, don't account payload, it will be > @@ -589,7 +598,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio) > io->orig_bio = bio; > io->md = md; > spin_lock_init(&io->lock); > - io->start_time = jiffies; > + io->start_time = blk_get_iostat_ticks(bio->bi_bdev->bd_queue); > io->flags = 0; > > if (static_branch_unlikely(&stats_enabled)) The above seems correct, by checking the top-level DM device's disposition on QUEUE_FLAG_PRECISE_IO_STAT, but I'm now wondering if there should be code that ensures consistency across stacked devices (which DM is the biggest creator/consumer of)? dm_table_set_restrictions() deals with such inconsistencies at DM device creation time (so basically: if any device in the DM table has QUEUE_FLAG_PRECISE_IO_STAT set then the top-level DM device should inherit that queue flag). A user could still override it but at least the default will be sane initially. Mike
diff --git a/block/blk-core.c b/block/blk-core.c index 5670032fe932..0b5e4eb909a5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -927,6 +927,18 @@ int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob, } EXPORT_SYMBOL_GPL(iocb_bio_iopoll); +/* + * Get the time based upon the available granularity for io accounting + * If the resolution mode is set to precise (2) i.e + * (/sys/block/<device>/queue/iostats = 2), then this will return time + * in nano-seconds else this will return time in jiffies + */ +u64 blk_get_iostat_ticks(struct request_queue *q) +{ + return (blk_queue_precise_io_stat(q) ? ktime_get_ns() : jiffies); +} +EXPORT_SYMBOL_GPL(blk_get_iostat_ticks); + void update_io_ticks(struct block_device *part, u64 now, bool end) { u64 stamp; @@ -968,20 +980,26 @@ EXPORT_SYMBOL(bdev_start_io_acct); u64 bio_start_io_acct(struct bio *bio) { return bdev_start_io_acct(bio->bi_bdev, bio_sectors(bio), - bio_op(bio), jiffies); + bio_op(bio), + blk_get_iostat_ticks(bio->bi_bdev->bd_queue)); } EXPORT_SYMBOL_GPL(bio_start_io_acct); void bdev_end_io_acct(struct block_device *bdev, enum req_op op, u64 start_time) { + u64 now; + u64 duration; + struct request_queue *q = bdev_get_queue(bdev); const int sgrp = op_stat_group(op); - u64 now = READ_ONCE(jiffies); - u64 duration = (unsigned long)now -(unsigned long) start_time; + now = blk_get_iostat_ticks(q);; + duration = (unsigned long)now -(unsigned long) start_time; part_stat_lock(); update_io_ticks(bdev, now, true); - part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration)); + part_stat_add(bdev, nsecs[sgrp], + (blk_queue_precise_io_stat(q) ? duration : + jiffies_to_nsecs(duration))); part_stat_local_dec(bdev, in_flight[op_is_write(op)]); part_stat_unlock(); } diff --git a/block/blk-mq.c b/block/blk-mq.c index 4e6b3ccd4989..5318bf87f17e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -975,7 +975,7 @@ static void __blk_account_io_done(struct request *req, u64 now) const int sgrp = op_stat_group(req_op(req)); part_stat_lock(); - update_io_ticks(req->part, jiffies, true); + update_io_ticks(req->part, blk_get_iostat_ticks(req->q), true); part_stat_inc(req->part, ios[sgrp]); part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns); part_stat_unlock(); @@ -1007,7 +1007,7 @@ static void __blk_account_io_start(struct request *rq) rq->part = rq->q->disk->part0; part_stat_lock(); - update_io_ticks(rq->part, jiffies, false); + update_io_ticks(rq->part, blk_get_iostat_ticks(rq->q), false); part_stat_unlock(); } diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 93d9e9c9a6ea..e7959782ce59 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -307,7 +307,6 @@ queue_##name##_store(struct request_queue *q, const char *page, size_t count) \ QUEUE_SYSFS_BIT_FNS(nonrot, NONROT, 1); QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0); -QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0); QUEUE_SYSFS_BIT_FNS(stable_writes, STABLE_WRITES, 0); #undef QUEUE_SYSFS_BIT_FNS @@ -363,6 +362,44 @@ static ssize_t queue_nomerges_store(struct request_queue *q, const char *page, return ret; } +static ssize_t queue_iostats_show(struct request_queue *q, char *page) +{ + bool iostat = blk_queue_io_stat(q); + bool precise_iostat = blk_queue_precise_io_stat(q); + + return queue_var_show(iostat << precise_iostat, page); +} + +static ssize_t queue_iostats_store(struct request_queue *q, const char *page, + size_t count) +{ + unsigned long val; + ssize_t ret = -EINVAL; + + ret = queue_var_store(&val, page, count); + + if (ret < 0) + return ret; + + if (val == 2) { + blk_queue_flag_set(QUEUE_FLAG_IO_STAT, q); + blk_queue_flag_set(QUEUE_FLAG_PRECISE_IO_STAT, q); + q->disk->part0->bd_stamp = 0; + } + else if (val == 1) { + blk_queue_flag_set(QUEUE_FLAG_IO_STAT, q); + blk_queue_flag_clear(QUEUE_FLAG_PRECISE_IO_STAT, q); + q->disk->part0->bd_stamp = 0; + } + else if (val == 0) { + blk_queue_flag_clear(QUEUE_FLAG_IO_STAT, q); + blk_queue_flag_clear(QUEUE_FLAG_PRECISE_IO_STAT, q); + q->disk->part0->bd_stamp = 0; + } + + return ret; +} + static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page) { bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags); diff --git a/block/genhd.c b/block/genhd.c index 03a96d6473e1..d034219a4683 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -951,6 +951,7 @@ ssize_t part_stat_show(struct device *dev, struct request_queue *q = bdev_get_queue(bdev); struct disk_stats stat; unsigned int inflight; + u64 stat_ioticks; if (queue_is_mq(q)) inflight = blk_mq_in_flight(q, bdev); @@ -959,10 +960,13 @@ ssize_t part_stat_show(struct device *dev, if (inflight) { part_stat_lock(); - update_io_ticks(bdev, jiffies, true); + update_io_ticks(bdev, blk_get_iostat_ticks(q), true); part_stat_unlock(); } part_stat_read_all(bdev, &stat); + stat_ioticks = (blk_queue_precise_io_stat(q) ? + div_u64(stat.io_ticks, NSEC_PER_MSEC) : + jiffies_to_msecs(stat.io_ticks)); return sprintf(buf, "%8lu %8lu %8llu %8u " "%8lu %8lu %8llu %8u " @@ -979,7 +983,7 @@ ssize_t part_stat_show(struct device *dev, (unsigned long long)stat.sectors[STAT_WRITE], (unsigned int)div_u64(stat.nsecs[STAT_WRITE], NSEC_PER_MSEC), inflight, - jiffies_to_msecs(stat.io_ticks), + (unsigned int)stat_ioticks, (unsigned int)div_u64(stat.nsecs[STAT_READ] + stat.nsecs[STAT_WRITE] + stat.nsecs[STAT_DISCARD] + @@ -1217,6 +1221,8 @@ static int diskstats_show(struct seq_file *seqf, void *v) unsigned int inflight; struct disk_stats stat; unsigned long idx; + struct request_queue *q; + u64 stat_ioticks; /* if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next) @@ -1235,12 +1241,16 @@ static int diskstats_show(struct seq_file *seqf, void *v) else inflight = part_in_flight(hd); + q = bdev_get_queue(hd); if (inflight) { part_stat_lock(); - update_io_ticks(hd, jiffies, true); + update_io_ticks(hd, blk_get_iostat_ticks(q), true); part_stat_unlock(); } part_stat_read_all(hd, &stat); + stat_ioticks = (blk_queue_precise_io_stat(q) ? + div_u64(stat.io_ticks, NSEC_PER_MSEC) : + jiffies_to_msecs(stat.io_ticks)); seq_printf(seqf, "%4d %7d %pg " "%lu %lu %lu %u " "%lu %lu %lu %u " @@ -1260,7 +1270,7 @@ static int diskstats_show(struct seq_file *seqf, void *v) (unsigned int)div_u64(stat.nsecs[STAT_WRITE], NSEC_PER_MSEC), inflight, - jiffies_to_msecs(stat.io_ticks), + (unsigned int)stat_ioticks, (unsigned int)div_u64(stat.nsecs[STAT_READ] + stat.nsecs[STAT_WRITE] + stat.nsecs[STAT_DISCARD] + diff --git a/drivers/block/drbd/drbd_debugfs.c b/drivers/block/drbd/drbd_debugfs.c index a72c096aa5b1..af193bcc4f3a 100644 --- a/drivers/block/drbd/drbd_debugfs.c +++ b/drivers/block/drbd/drbd_debugfs.c @@ -97,6 +97,12 @@ static void seq_print_one_request(struct seq_file *m, struct drbd_request *req, { /* change anything here, fixup header below! */ unsigned int s = req->rq_state; + unsigned long start_time; + struct request_queue *q = req->device->rq_queue; + + start_time = (blk_queue_precise_io_stat(q) ? + nsecs_to_jiffies(req->start_jif) : + req->start_jif); #define RQ_HDR_1 "epoch\tsector\tsize\trw" seq_printf(m, "0x%x\t%llu\t%u\t%s", @@ -105,7 +111,7 @@ static void seq_print_one_request(struct seq_file *m, struct drbd_request *req, (s & RQ_WRITE) ? "W" : "R"); #define RQ_HDR_2 "\tstart\tin AL\tsubmit" - seq_printf(m, "\t%d", jiffies_to_msecs(now - req->start_jif)); + seq_printf(m, "\t%d", jiffies_to_msecs(now - start_time)); seq_print_age_or_dash(m, s & RQ_IN_ACT_LOG, now - req->in_actlog_jif); seq_print_age_or_dash(m, s & RQ_LOCAL_PENDING, now - req->pre_submit_jif); @@ -171,7 +177,9 @@ static void seq_print_waiting_for_AL(struct seq_file *m, struct drbd_resource *r /* if the oldest request does not wait for the activity log * it is not interesting for us here */ if (req && !(req->rq_state & RQ_IN_ACT_LOG)) - jif = req->start_jif; + jif = (blk_queue_precise_io_stat(device->rq_queue) ? + nsecs_to_jiffies(req->start_jif): + req->start_jif); else req = NULL; spin_unlock_irq(&device->resource->req_lock); diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index da28eb83e6ed..8e22487de7de 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1663,6 +1663,7 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector, struct zram *zram; struct bio_vec bv; u64 start_time; + struct request_queue *q = bdev_get_queue(bdev); if (PageTransHuge(page)) return -ENOTSUPP; @@ -1682,7 +1683,7 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector, bv.bv_offset = 0; start_time = bdev_start_io_acct(bdev->bd_disk->part0, - SECTORS_PER_PAGE, op, jiffies); + SECTORS_PER_PAGE, op, blk_get_iostat_ticks(q)); ret = zram_bvec_rw(zram, &bv, index, offset, op, NULL); bdev_end_io_acct(bdev->bd_disk->part0, op, start_time); out: diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 011a85ea40da..1bb58a0b8cd1 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -482,7 +482,11 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode, u64 dm_start_time_ns_from_clone(struct bio *bio) { - return jiffies_to_nsecs(clone_to_tio(bio)->io->start_time); + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + u64 start_time_ns = (blk_queue_precise_io_stat(q) ? + clone_to_tio(bio)->io->start_time : + jiffies_to_nsecs(clone_to_tio(bio)->io->start_time)); + return start_time_ns; } EXPORT_SYMBOL_GPL(dm_start_time_ns_from_clone); @@ -498,6 +502,11 @@ static void dm_io_acct(struct dm_io *io, bool end) struct mapped_device *md = io->md; struct bio *bio = io->orig_bio; unsigned int sectors; + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + + u64 start_time = (blk_queue_precise_io_stat(q) ? + nsecs_to_jiffies(io->start_time) : + io->start_time); /* * If REQ_PREFLUSH set, don't account payload, it will be @@ -589,7 +598,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio) io->orig_bio = bio; io->md = md; spin_lock_init(&io->lock); - io->start_time = jiffies; + io->start_time = blk_get_iostat_ticks(bio->bi_bdev->bd_queue); io->flags = 0; if (static_branch_unlikely(&stats_enabled)) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index ca94d690d292..0543a536c6e5 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -570,6 +570,7 @@ struct request_queue { #define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */ #define QUEUE_FLAG_SQ_SCHED 30 /* single queue style io dispatch */ #define QUEUE_FLAG_SKIP_TAGSET_QUIESCE 31 /* quiesce_tagset skip the queue*/ +#define QUEUE_FLAG_PRECISE_IO_STAT 32 /* To enable precise io accounting */ #define QUEUE_FLAG_MQ_DEFAULT ((1UL << QUEUE_FLAG_IO_STAT) | \ (1UL << QUEUE_FLAG_SAME_COMP) | \ @@ -578,6 +579,7 @@ struct request_queue { void blk_queue_flag_set(unsigned int flag, struct request_queue *q); void blk_queue_flag_clear(unsigned int flag, struct request_queue *q); bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q); +u64 blk_get_iostat_ticks(struct request_queue *q); #define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags) #define blk_queue_dying(q) test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags) @@ -589,6 +591,8 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q); #define blk_queue_stable_writes(q) \ test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags) #define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags) +#define blk_queue_precise_io_stat(q) \ + test_bit(QUEUE_FLAG_PRECISE_IO_STAT, &(q)->queue_flags) #define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags) #define blk_queue_zone_resetall(q) \ test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
Problem Desc ============ The "iostat" user-space utility was showing %util as 100% for the disks which has latencies less than a milli-second i.e for latencies in the range of micro-seconds and below. Root Cause ========== The IO accounting in block layer is currently done by updating the io_ticks in jiffies which is of milli-seconds granularity. Due to this, for the devices with IO latencies less than a milli-second, the latency will be accounted as 1 milli-second even-though its in the range of micro-seconds. This was causing the iostat command to show %util as 100% which is incorrect. Recreationg of the issue ======================== Setup ----- Devices: NVMe 24 devices Model number: 4610 (Intel) fio --- [global] bs=4K iodepth=1 direct=1 ioengine=libaio group_reporting time_based runtime=100 thinktime=1ms numjobs=1 name=raw-write rw=randrw ignore_error=EIO:EIO [job1] filename=/dev/nvme0n1 iostat o/p ---------- Device %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme3n1 0.00 0.05 0.00 75.38 0.50 0.00 0.00 100.00 Device %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme3n1 0.00 0.05 0.00 74.74 0.50 0.00 0.00 100.00 Solution ======== Use ktime_get_ns() to update the disk_stats io_ticks so that the io_ticks are updated for every io start and end times. Issues using ktime ================== Using ktime_get_ns() has a performance overhead as ktime will go ahead and reads the clock everytime its called. Following test environment was used by Jens Axboe on which t/io_uring was run which has shown a high performance drop. Devices ------- SSDs: P5800X No of devices: 24 io_uring config --------------- Polled IO iostats: Enabled Reads: Random Block Size: 512 QDepth: 128 Batch Submit: 32 Batch Complete: 32 No of Threads: 24 With the above environment and ktime patch, it has shown a performance drop of ~25% from iostats disabled and ~19% performance drop from current iostats enabled. This performance drop is high. Suggestion from Jens Axboe ========================== Jens Axboe suggested to split the bigger patch into two as follows: 1. In first patch, change all the types from unsigned long to u64, that can be done while retaining jiffies. 2. In second patch, add an iostats == 2 setting, which enables the higher resolution mode using ktime. We'd still default to 1, lower granularity iostats enabled. Fix details =========== 1. Use ktime_get_ns() to get the current nano-seconds to update the io_ticks for start and end time stamps in block layer for io accounting 2. Create a new setting '2' in sysfs for iostats variable i.e for /sys/block/<device-name>/queue/iostats, to enable the iostat (io accounting) with nano-seconds (using ktime) granularity. This setting should be enabled only if the iostat is needed with high resolution mode as it has a high performance drop 3. Earlier available settings were 0 and 1 for disable and enable io accounting with milli-seconds granularity (jiffies) Testing ======= Ran the t/io_uring command with following setup: Devices ------- SSDs: P4610 No of devices: 8 io_uring config --------------- Polled IO iostats: Enabled Reads: Random Block Size: 512 QDepth: 128 Batch Submit: 32 Batch Complete: 32 No of Threads: 24 io_uring o/p ------------ iostat=0, with patch: Maximum IOPS=10.09M iostat=1, with patch: Maximum IOPS=9.84M iostat=2, with patch: Maximum IOPS=9.48M Changes from V2 to V3 ===================== 1. Changed all the required variables data-type to u64 as a first patch 2. Create a new setting '2' for iostats in sysfs in this patch 3. Change the code to get the ktime values when iostat=2, in this patch Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com> --- block/blk-core.c | 26 +++++++++++++++++---- block/blk-mq.c | 4 ++-- block/blk-sysfs.c | 39 ++++++++++++++++++++++++++++++- block/genhd.c | 18 ++++++++++---- drivers/block/drbd/drbd_debugfs.c | 12 ++++++++-- drivers/block/zram/zram_drv.c | 3 ++- drivers/md/dm.c | 13 +++++++++-- include/linux/blkdev.h | 4 ++++ 8 files changed, 103 insertions(+), 16 deletions(-)