diff mbox

[V4,13/15] blk-throttle: add a mechanism to estimate IO latency

Message ID 9c75c9852b08404b90fbbdb143e81a8ef3b36abf.1479161136.git.shli@fb.com (mailing list archive)
State New, archived
Headers show

Commit Message

Shaohua Li Nov. 14, 2016, 10:22 p.m. UTC
We try to set a latency target for each cgroup. The problem is latency
highly depends on request size, users can't configure the target for
every request size. The idea is users configure latency target for 4k
IO, we estimate the target latency for other request size IO.

To do this, we sample some data, eg, average latency for request size
4k, 8k, 16k, 32k, 64k. We then use an equation f(x) = a * x + b to fit
the data (x is request size in KB, f(x) is the latency). Then we can use
the equation to estimate IO target latency for any request.

To increase the chance of sampling, we actually collect data for any IO
size less than 64k, then calcualte an average latency/size. This is ok
for line fit because the equation should work for average request
size/latency too.

But we shouldn't sample data at any time. If disk is congested, the
calculated data will not represent the disk's capability. Hence we only
do the sampling when block throttling is in the HIGH limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, high limit is too high, calculated latency target will be
higher.

How does the equation fit to actual data? I collected data from 4
different SSDs (one SATA, 3 NVMe). The error range is quite small. The
big difference between measured latency and calculated latency generally
comes from 4k IO. The biggest one has around 30% difference, which isn't
terrible as we don't need accurate latency target. We don't know if line
fit works for other SSDs though. For big request size latency, the error
range seems big. But this mechanism is to determine if we should
throttle IO (eg, if cgroup is idle). If cgroups average request size is
big, we can simply treat it as busy, hence we don't need the mechanism.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. So this latency target feature is for SSD only.

The patch uses below algorithm to calculate the equation:
https://en.wikipedia.org/wiki/Simple_linear_regression

TODO: the latency sampling is better moving to request layer

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c      | 191 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/blk_types.h |   2 +
 2 files changed, 190 insertions(+), 3 deletions(-)

Comments

kernel test robot Nov. 14, 2016, 11:40 p.m. UTC | #1
Hi Shaohua,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.9-rc5]
[cannot apply to block/for-next next-20161114]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Shaohua-Li/blk-throttle-add-high-limit/20161115-063105
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All warnings (new ones prefixed by >>):

   In file included from ./arch/xtensa/include/generated/asm/div64.h:1:0,
                    from include/linux/kernel.h:142,
                    from include/linux/list.h:8,
                    from include/linux/module.h:9,
                    from block/blk-throttle.c:7:
   block/blk-throttle.c: In function 'throtl_calculate_line_slope':
   include/asm-generic/div64.h:207:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
>> block/blk-throttle.c:2017:2: note: in expansion of macro 'do_div'
     do_div(xMean, valid_lat);
     ^
   include/asm-generic/div64.h:207:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
   block/blk-throttle.c:2019:2: note: in expansion of macro 'do_div'
     do_div(yMean, valid_lat);
     ^
   include/asm-generic/div64.h:207:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
   block/blk-throttle.c:2023:2: note: in expansion of macro 'do_div'
     do_div(slope, denominator);
     ^

vim +/do_div +2017 block/blk-throttle.c

  2001		for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
  2002			u64 x, y;
  2003	
  2004			if (td->avg_buckets[i].latency == 0)
  2005				continue;
  2006	
  2007			x = td->avg_buckets[i].size >> 10;
  2008			y = td->avg_buckets[i].latency;
  2009			sumX += x;
  2010			sumY += y;
  2011	
  2012			sumXY += x * y;
  2013			sumX2 += x * x;
  2014		}
  2015	
  2016		xMean = sumX;
> 2017		do_div(xMean, valid_lat);
  2018		yMean = sumY;
  2019		do_div(yMean, valid_lat);
  2020		denominator = sumX2 - sumX * xMean;
  2021	
  2022		slope = sumXY - sumX * yMean;
  2023		do_div(slope, denominator);
  2024		td->line_slope = slope;
  2025	}

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
kernel test robot Nov. 15, 2016, 3:57 a.m. UTC | #2
Hi Shaohua,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.9-rc5]
[cannot apply to block/for-next next-20161114]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Shaohua-Li/blk-throttle-add-high-limit/20161115-063105
config: openrisc-allyesconfig (attached as .config)
compiler: or32-linux-gcc (GCC) 4.5.1-or32-1.0rc1
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=openrisc 

All warnings (new ones prefixed by >>):

   block/blk-throttle.c: In function 'throtl_calculate_line_slope':
>> block/blk-throttle.c:2017:2: warning: comparison of distinct pointer types lacks a cast
   block/blk-throttle.c:2019:2: warning: comparison of distinct pointer types lacks a cast
   block/blk-throttle.c:2023:2: warning: comparison of distinct pointer types lacks a cast

vim +2017 block/blk-throttle.c

  2001		for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
  2002			u64 x, y;
  2003	
  2004			if (td->avg_buckets[i].latency == 0)
  2005				continue;
  2006	
  2007			x = td->avg_buckets[i].size >> 10;
  2008			y = td->avg_buckets[i].latency;
  2009			sumX += x;
  2010			sumY += y;
  2011	
  2012			sumXY += x * y;
  2013			sumX2 += x * x;
  2014		}
  2015	
  2016		xMean = sumX;
> 2017		do_div(xMean, valid_lat);
  2018		yMean = sumY;
  2019		do_div(yMean, valid_lat);
  2020		denominator = sumX2 - sumX * xMean;
  2021	
  2022		slope = sumXY - sumX * yMean;
  2023		do_div(slope, denominator);
  2024		td->line_slope = slope;
  2025	}

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Tejun Heo Nov. 29, 2016, 5:24 p.m. UTC | #3
Hello, Shaohua.

On Mon, Nov 14, 2016 at 02:22:20PM -0800, Shaohua Li wrote:
> To do this, we sample some data, eg, average latency for request size
> 4k, 8k, 16k, 32k, 64k. We then use an equation f(x) = a * x + b to fit
> the data (x is request size in KB, f(x) is the latency). Then we can use
> the equation to estimate IO target latency for any request.

As discussed separately, it might make more sense to just use the avg
of the closest bucket instead of trying to line-fit the buckets, but
it's an implementation detail and whatever which works is fine.

> Hard disk is completely different. Latency depends on spindle seek
> instead of request size. So this latency target feature is for SSD only.

I'm not sure about this.  While a disk's latency profile is way higher
and more erratic than SSDs, that doesn't make latency target useless.
Sure, it'll be more crude but there's a significant difference between
a cgroup having <= 20ms overall latency and experiencing multi-sec
latency.

Thanks.
Shaohua Li Nov. 29, 2016, 6:30 p.m. UTC | #4
On Tue, Nov 29, 2016 at 12:24:35PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Mon, Nov 14, 2016 at 02:22:20PM -0800, Shaohua Li wrote:
> > To do this, we sample some data, eg, average latency for request size
> > 4k, 8k, 16k, 32k, 64k. We then use an equation f(x) = a * x + b to fit
> > the data (x is request size in KB, f(x) is the latency). Then we can use
> > the equation to estimate IO target latency for any request.
> 
> As discussed separately, it might make more sense to just use the avg
> of the closest bucket instead of trying to line-fit the buckets, but
> it's an implementation detail and whatever which works is fine.

that is still like a line fit. Don't think there is big difference.

> > Hard disk is completely different. Latency depends on spindle seek
> > instead of request size. So this latency target feature is for SSD only.
> 
> I'm not sure about this.  While a disk's latency profile is way higher
> and more erratic than SSDs, that doesn't make latency target useless.
> Sure, it'll be more crude but there's a significant difference between
> a cgroup having <= 20ms overall latency and experiencing multi-sec
> latency.

Sure, latency target is useful for hardisk too. But we need a different
stragety. For hard disk, the latency highly depends on seek. Probably we can
make the latency target the same for all request size. Not sure if average
latency makes sense. Need more tests with hard disk. I'd like to forcus on SSD
in current stage.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 29, 2016, 10:36 p.m. UTC | #5
Hello,

On Tue, Nov 29, 2016 at 10:30:44AM -0800, Shaohua Li wrote:
> > As discussed separately, it might make more sense to just use the avg
> > of the closest bucket instead of trying to line-fit the buckets, but
> > it's an implementation detail and whatever which works is fine.
> 
> that is still like a line fit. Don't think there is big difference.

Yeah, just wondering whether that'd be simpler.

> > > Hard disk is completely different. Latency depends on spindle seek
> > > instead of request size. So this latency target feature is for SSD only.
> > 
> > I'm not sure about this.  While a disk's latency profile is way higher
> > and more erratic than SSDs, that doesn't make latency target useless.
> > Sure, it'll be more crude but there's a significant difference between
> > a cgroup having <= 20ms overall latency and experiencing multi-sec
> > latency.
> 
> Sure, latency target is useful for hardisk too. But we need a different
> stragety. For hard disk, the latency highly depends on seek. Probably we can
> make the latency target the same for all request size. Not sure if average
> latency makes sense. Need more tests with hard disk. I'd like to forcus on SSD
> in current stage.

Sure, it's fine to focus on SSDs for now but with either line fitting
or bucketed avg, it should be fine, right?  The slope of the line
would be way lower and the deviation would be higher but that doesn't
really get in the way here.

Thanks.
diff mbox

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 01b494d..a05d351 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -156,6 +156,20 @@  struct throtl_grp {
 	u64 avg_ttime;
 };
 
+/* We measure latency for request size from 4k to 4k * ( 1 << 4) */
+#define LATENCY_BUCKET_SIZE 5
+
+struct latency_bucket {
+	u64 total_latency;
+	u64 total_size;
+	int samples;
+};
+
+struct avg_latency_bucket {
+	u64 latency;
+	u64 size;
+};
+
 struct throtl_data
 {
 	/* service tree for active throtl groups */
@@ -179,6 +193,12 @@  struct throtl_data
 	unsigned int scale;
 
 	u64 idle_ttime_threshold;
+
+	struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
+	struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
+	struct latency_bucket __percpu *latency_buckets;
+	s64 line_slope;
+	unsigned long last_calculate_time;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -288,6 +308,19 @@  static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 	return ret;
 }
 
+static int request_bucket_index(sector_t sectors)
+{
+	int i;
+
+	for (i = LATENCY_BUCKET_SIZE - 1; i >= 0; i--) {
+		if (sectors > (1 << (i + 3)))
+			break;
+	}
+	if (i == LATENCY_BUCKET_SIZE - 1)
+		return -1;
+	return i + 1;
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -1877,6 +1910,120 @@  static void blk_throtl_update_ttime(struct throtl_grp *tg)
 	tg->checked_last_finish_time = last_finish_time;
 }
 
+static void throtl_calculate_line_slope(struct throtl_data *td)
+{
+	struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
+	s64 sumX;
+	s64 sumY;
+	s64 sumXY;
+	s64 sumX2;
+	s64 xMean;
+	s64 yMean;
+	s64 denominator;
+	s64 slope;
+	int i, cpu;
+	int valid_lat;
+	u64 last_latency = 0;
+
+	if (!blk_queue_nonrot(td->queue))
+		return;
+	if (time_before(jiffies, td->last_calculate_time + HZ))
+		return;
+	td->last_calculate_time = jiffies;
+
+	memset(avg_latency, 0, sizeof(avg_latency));
+	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+		struct latency_bucket *tmp = &td->tmp_buckets[i];
+
+		for_each_possible_cpu(cpu) {
+			struct latency_bucket *bucket;
+
+			bucket = per_cpu_ptr(td->latency_buckets, cpu);
+			tmp->total_latency += bucket[i].total_latency;
+			tmp->total_size += bucket[i].total_size;
+			tmp->samples += bucket[i].samples;
+			bucket[i].total_latency = 0;
+			bucket[i].total_size = 0;
+			bucket[i].samples = 0;
+		}
+
+		if (tmp->samples >= 32) {
+			u64 latency = tmp->total_latency;
+			u64 size = tmp->total_size;
+			int samples = tmp->samples;
+
+			tmp->total_latency = 0;
+			tmp->total_size = 0;
+			tmp->samples = 0;
+			do_div(size, samples);
+			if (size == 0 || size > (1 << (i + 12)))
+				continue;
+			avg_latency[i].size = size;
+			do_div(latency, samples);
+			if (latency == 0)
+				continue;
+			avg_latency[i].latency = latency;
+		}
+	}
+
+	valid_lat = 0;
+	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+		if (!td->avg_buckets[i].latency && !avg_latency[i].latency)
+			continue;
+		valid_lat++;
+		if (!td->avg_buckets[i].latency) {
+			td->avg_buckets[i].latency = avg_latency[i].latency;
+			td->avg_buckets[i].size = avg_latency[i].size;
+			continue;
+		}
+		if (!avg_latency[i].latency)
+			continue;
+		/* make it smooth */
+		td->avg_buckets[i].latency = (td->avg_buckets[i].latency * 7 +
+			avg_latency[i].latency) >> 3;
+		td->avg_buckets[i].size = (td->avg_buckets[i].size * 7 +
+			avg_latency[i].size) >> 3;
+		/* filter out abnormal latency */
+		if (td->avg_buckets[i].latency <= last_latency) {
+			td->avg_buckets[i].latency = 0;
+			valid_lat--;
+		} else
+			last_latency = td->avg_buckets[i].latency;
+	}
+
+	if (valid_lat < 2)
+		return;
+
+	sumX = 0;
+	sumY = 0;
+	sumXY = 0;
+	sumX2 = 0;
+	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+		u64 x, y;
+
+		if (td->avg_buckets[i].latency == 0)
+			continue;
+
+		x = td->avg_buckets[i].size >> 10;
+		y = td->avg_buckets[i].latency;
+		sumX += x;
+		sumY += y;
+
+		sumXY += x * y;
+		sumX2 += x * x;
+	}
+
+	xMean = sumX;
+	do_div(xMean, valid_lat);
+	yMean = sumY;
+	do_div(yMean, valid_lat);
+	denominator = sumX2 - sumX * xMean;
+
+	slope = sumXY - sumX * yMean;
+	do_div(slope, denominator);
+	td->line_slope = slope;
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1901,11 +2048,14 @@  bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	spin_lock_irq(q->queue_lock);
 
+	throtl_calculate_line_slope(tg->td);
+
 	if (unlikely(blk_queue_bypass(q)))
 		goto out_unlock;
 
 	bio_associate_current(bio);
 	bio->bi_cg_private = q;
+	bio->bi_cg_size = bio_sectors(bio);
 
 	blk_throtl_update_ttime(tg);
 
@@ -1992,8 +2142,11 @@  bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	 * don't want bios to leave with the flag set.  Clear the flag if
 	 * being issued.
 	 */
-	if (!throttled)
+	if (!throttled) {
+		if (blk_queue_nonrot(q))
+			bio->bi_cg_issue_time = ktime_get_ns();
 		bio->bi_opf &= ~REQ_THROTTLED;
+	}
 	return throttled;
 }
 
@@ -2003,6 +2156,9 @@  void blk_throtl_bio_endio(struct bio *bio)
 	struct blkcg_gq *blkg;
 	struct throtl_grp *tg;
 	struct request_queue *q;
+	struct throtl_data *td;
+	u64 finish_time;
+	u64 lat;
 
 	q = bio->bi_cg_private;
 	if (!q)
@@ -2019,7 +2175,27 @@  void blk_throtl_bio_endio(struct bio *bio)
 
 	tg = blkg_to_tg(blkg ?: q->root_blkg);
 
-	tg->last_finish_time = ktime_get_ns();
+	finish_time = ktime_get_ns();
+	tg->last_finish_time = finish_time;
+
+	td = tg->td;
+
+	if (bio->bi_cg_issue_time && finish_time > bio->bi_cg_issue_time) {
+		int index;
+
+		lat = finish_time - bio->bi_cg_issue_time;
+		index = request_bucket_index(bio->bi_cg_size);
+		if (index >= 0 && bio_op(bio) == REQ_OP_READ &&
+				td->limit_index == LIMIT_HIGH) {
+			struct latency_bucket *latency;
+
+			latency = get_cpu_ptr(td->latency_buckets);
+			latency[index].total_latency += lat;
+			latency[index].total_size += bio->bi_cg_size << 9;
+			latency[index].samples++;
+			put_cpu_ptr(td->latency_buckets);
+		}
+	}
 
 end:
 	rcu_read_unlock();
@@ -2097,6 +2273,12 @@  int blk_throtl_init(struct request_queue *q)
 	td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
 	if (!td)
 		return -ENOMEM;
+	td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) *
+		LATENCY_BUCKET_SIZE, __alignof__(u64));
+	if (!td->latency_buckets) {
+		kfree(td);
+		return -ENOMEM;
+	}
 
 	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 	throtl_service_queue_init(&td->service_queue);
@@ -2113,8 +2295,10 @@  int blk_throtl_init(struct request_queue *q)
 	td->idle_ttime_threshold = -1;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
-	if (ret)
+	if (ret) {
+		free_percpu(td->latency_buckets);
 		kfree(td);
+	}
 	return ret;
 }
 
@@ -2123,6 +2307,7 @@  void blk_throtl_exit(struct request_queue *q)
 	BUG_ON(!q->td);
 	throtl_shutdown_wq(q);
 	blkcg_deactivate_policy(q, &blkcg_policy_throtl);
+	free_percpu(q->td->latency_buckets);
 	kfree(q->td);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ff8dd24..45bb437 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -60,6 +60,8 @@  struct bio {
 	struct io_context	*bi_ioc;
 	struct cgroup_subsys_state *bi_css;
 	void *bi_cg_private;
+	u64 bi_cg_issue_time;
+	sector_t bi_cg_size;
 #endif
 	union {
 #if defined(CONFIG_BLK_DEV_INTEGRITY)