Message ID | 9c75c9852b08404b90fbbdb143e81a8ef3b36abf.1479161136.git.shli@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Shaohua, [auto build test WARNING on linus/master] [also build test WARNING on v4.9-rc5] [cannot apply to block/for-next next-20161114] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Shaohua-Li/blk-throttle-add-high-limit/20161115-063105 config: xtensa-allmodconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 4.9.0 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=xtensa All warnings (new ones prefixed by >>): In file included from ./arch/xtensa/include/generated/asm/div64.h:1:0, from include/linux/kernel.h:142, from include/linux/list.h:8, from include/linux/module.h:9, from block/blk-throttle.c:7: block/blk-throttle.c: In function 'throtl_calculate_line_slope': include/asm-generic/div64.h:207:28: warning: comparison of distinct pointer types lacks a cast (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \ ^ >> block/blk-throttle.c:2017:2: note: in expansion of macro 'do_div' do_div(xMean, valid_lat); ^ include/asm-generic/div64.h:207:28: warning: comparison of distinct pointer types lacks a cast (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \ ^ block/blk-throttle.c:2019:2: note: in expansion of macro 'do_div' do_div(yMean, valid_lat); ^ include/asm-generic/div64.h:207:28: warning: comparison of distinct pointer types lacks a cast (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \ ^ block/blk-throttle.c:2023:2: note: in expansion of macro 'do_div' do_div(slope, denominator); ^ vim +/do_div +2017 block/blk-throttle.c 2001 for (i = 0; i < LATENCY_BUCKET_SIZE; i++) { 2002 u64 x, y; 2003 2004 if (td->avg_buckets[i].latency == 0) 2005 continue; 2006 2007 x = td->avg_buckets[i].size >> 10; 2008 y = td->avg_buckets[i].latency; 2009 sumX += x; 2010 sumY += y; 2011 2012 sumXY += x * y; 2013 sumX2 += x * x; 2014 } 2015 2016 xMean = sumX; > 2017 do_div(xMean, valid_lat); 2018 yMean = sumY; 2019 do_div(yMean, valid_lat); 2020 denominator = sumX2 - sumX * xMean; 2021 2022 slope = sumXY - sumX * yMean; 2023 do_div(slope, denominator); 2024 td->line_slope = slope; 2025 } --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Shaohua, [auto build test WARNING on linus/master] [also build test WARNING on v4.9-rc5] [cannot apply to block/for-next next-20161114] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Shaohua-Li/blk-throttle-add-high-limit/20161115-063105 config: openrisc-allyesconfig (attached as .config) compiler: or32-linux-gcc (GCC) 4.5.1-or32-1.0rc1 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=openrisc All warnings (new ones prefixed by >>): block/blk-throttle.c: In function 'throtl_calculate_line_slope': >> block/blk-throttle.c:2017:2: warning: comparison of distinct pointer types lacks a cast block/blk-throttle.c:2019:2: warning: comparison of distinct pointer types lacks a cast block/blk-throttle.c:2023:2: warning: comparison of distinct pointer types lacks a cast vim +2017 block/blk-throttle.c 2001 for (i = 0; i < LATENCY_BUCKET_SIZE; i++) { 2002 u64 x, y; 2003 2004 if (td->avg_buckets[i].latency == 0) 2005 continue; 2006 2007 x = td->avg_buckets[i].size >> 10; 2008 y = td->avg_buckets[i].latency; 2009 sumX += x; 2010 sumY += y; 2011 2012 sumXY += x * y; 2013 sumX2 += x * x; 2014 } 2015 2016 xMean = sumX; > 2017 do_div(xMean, valid_lat); 2018 yMean = sumY; 2019 do_div(yMean, valid_lat); 2020 denominator = sumX2 - sumX * xMean; 2021 2022 slope = sumXY - sumX * yMean; 2023 do_div(slope, denominator); 2024 td->line_slope = slope; 2025 } --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hello, Shaohua. On Mon, Nov 14, 2016 at 02:22:20PM -0800, Shaohua Li wrote: > To do this, we sample some data, eg, average latency for request size > 4k, 8k, 16k, 32k, 64k. We then use an equation f(x) = a * x + b to fit > the data (x is request size in KB, f(x) is the latency). Then we can use > the equation to estimate IO target latency for any request. As discussed separately, it might make more sense to just use the avg of the closest bucket instead of trying to line-fit the buckets, but it's an implementation detail and whatever which works is fine. > Hard disk is completely different. Latency depends on spindle seek > instead of request size. So this latency target feature is for SSD only. I'm not sure about this. While a disk's latency profile is way higher and more erratic than SSDs, that doesn't make latency target useless. Sure, it'll be more crude but there's a significant difference between a cgroup having <= 20ms overall latency and experiencing multi-sec latency. Thanks.
On Tue, Nov 29, 2016 at 12:24:35PM -0500, Tejun Heo wrote: > Hello, Shaohua. > > On Mon, Nov 14, 2016 at 02:22:20PM -0800, Shaohua Li wrote: > > To do this, we sample some data, eg, average latency for request size > > 4k, 8k, 16k, 32k, 64k. We then use an equation f(x) = a * x + b to fit > > the data (x is request size in KB, f(x) is the latency). Then we can use > > the equation to estimate IO target latency for any request. > > As discussed separately, it might make more sense to just use the avg > of the closest bucket instead of trying to line-fit the buckets, but > it's an implementation detail and whatever which works is fine. that is still like a line fit. Don't think there is big difference. > > Hard disk is completely different. Latency depends on spindle seek > > instead of request size. So this latency target feature is for SSD only. > > I'm not sure about this. While a disk's latency profile is way higher > and more erratic than SSDs, that doesn't make latency target useless. > Sure, it'll be more crude but there's a significant difference between > a cgroup having <= 20ms overall latency and experiencing multi-sec > latency. Sure, latency target is useful for hardisk too. But we need a different stragety. For hard disk, the latency highly depends on seek. Probably we can make the latency target the same for all request size. Not sure if average latency makes sense. Need more tests with hard disk. I'd like to forcus on SSD in current stage. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Tue, Nov 29, 2016 at 10:30:44AM -0800, Shaohua Li wrote: > > As discussed separately, it might make more sense to just use the avg > > of the closest bucket instead of trying to line-fit the buckets, but > > it's an implementation detail and whatever which works is fine. > > that is still like a line fit. Don't think there is big difference. Yeah, just wondering whether that'd be simpler. > > > Hard disk is completely different. Latency depends on spindle seek > > > instead of request size. So this latency target feature is for SSD only. > > > > I'm not sure about this. While a disk's latency profile is way higher > > and more erratic than SSDs, that doesn't make latency target useless. > > Sure, it'll be more crude but there's a significant difference between > > a cgroup having <= 20ms overall latency and experiencing multi-sec > > latency. > > Sure, latency target is useful for hardisk too. But we need a different > stragety. For hard disk, the latency highly depends on seek. Probably we can > make the latency target the same for all request size. Not sure if average > latency makes sense. Need more tests with hard disk. I'd like to forcus on SSD > in current stage. Sure, it's fine to focus on SSDs for now but with either line fitting or bucketed avg, it should be fine, right? The slope of the line would be way lower and the deviation would be higher but that doesn't really get in the way here. Thanks.
diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 01b494d..a05d351 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -156,6 +156,20 @@ struct throtl_grp { u64 avg_ttime; }; +/* We measure latency for request size from 4k to 4k * ( 1 << 4) */ +#define LATENCY_BUCKET_SIZE 5 + +struct latency_bucket { + u64 total_latency; + u64 total_size; + int samples; +}; + +struct avg_latency_bucket { + u64 latency; + u64 size; +}; + struct throtl_data { /* service tree for active throtl groups */ @@ -179,6 +193,12 @@ struct throtl_data unsigned int scale; u64 idle_ttime_threshold; + + struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE]; + struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE]; + struct latency_bucket __percpu *latency_buckets; + s64 line_slope; + unsigned long last_calculate_time; }; static void throtl_pending_timer_fn(unsigned long arg); @@ -288,6 +308,19 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw) return ret; } +static int request_bucket_index(sector_t sectors) +{ + int i; + + for (i = LATENCY_BUCKET_SIZE - 1; i >= 0; i--) { + if (sectors > (1 << (i + 3))) + break; + } + if (i == LATENCY_BUCKET_SIZE - 1) + return -1; + return i + 1; +} + /** * throtl_log - log debug message via blktrace * @sq: the service_queue being reported @@ -1877,6 +1910,120 @@ static void blk_throtl_update_ttime(struct throtl_grp *tg) tg->checked_last_finish_time = last_finish_time; } +static void throtl_calculate_line_slope(struct throtl_data *td) +{ + struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE]; + s64 sumX; + s64 sumY; + s64 sumXY; + s64 sumX2; + s64 xMean; + s64 yMean; + s64 denominator; + s64 slope; + int i, cpu; + int valid_lat; + u64 last_latency = 0; + + if (!blk_queue_nonrot(td->queue)) + return; + if (time_before(jiffies, td->last_calculate_time + HZ)) + return; + td->last_calculate_time = jiffies; + + memset(avg_latency, 0, sizeof(avg_latency)); + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) { + struct latency_bucket *tmp = &td->tmp_buckets[i]; + + for_each_possible_cpu(cpu) { + struct latency_bucket *bucket; + + bucket = per_cpu_ptr(td->latency_buckets, cpu); + tmp->total_latency += bucket[i].total_latency; + tmp->total_size += bucket[i].total_size; + tmp->samples += bucket[i].samples; + bucket[i].total_latency = 0; + bucket[i].total_size = 0; + bucket[i].samples = 0; + } + + if (tmp->samples >= 32) { + u64 latency = tmp->total_latency; + u64 size = tmp->total_size; + int samples = tmp->samples; + + tmp->total_latency = 0; + tmp->total_size = 0; + tmp->samples = 0; + do_div(size, samples); + if (size == 0 || size > (1 << (i + 12))) + continue; + avg_latency[i].size = size; + do_div(latency, samples); + if (latency == 0) + continue; + avg_latency[i].latency = latency; + } + } + + valid_lat = 0; + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) { + if (!td->avg_buckets[i].latency && !avg_latency[i].latency) + continue; + valid_lat++; + if (!td->avg_buckets[i].latency) { + td->avg_buckets[i].latency = avg_latency[i].latency; + td->avg_buckets[i].size = avg_latency[i].size; + continue; + } + if (!avg_latency[i].latency) + continue; + /* make it smooth */ + td->avg_buckets[i].latency = (td->avg_buckets[i].latency * 7 + + avg_latency[i].latency) >> 3; + td->avg_buckets[i].size = (td->avg_buckets[i].size * 7 + + avg_latency[i].size) >> 3; + /* filter out abnormal latency */ + if (td->avg_buckets[i].latency <= last_latency) { + td->avg_buckets[i].latency = 0; + valid_lat--; + } else + last_latency = td->avg_buckets[i].latency; + } + + if (valid_lat < 2) + return; + + sumX = 0; + sumY = 0; + sumXY = 0; + sumX2 = 0; + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) { + u64 x, y; + + if (td->avg_buckets[i].latency == 0) + continue; + + x = td->avg_buckets[i].size >> 10; + y = td->avg_buckets[i].latency; + sumX += x; + sumY += y; + + sumXY += x * y; + sumX2 += x * x; + } + + xMean = sumX; + do_div(xMean, valid_lat); + yMean = sumY; + do_div(yMean, valid_lat); + denominator = sumX2 - sumX * xMean; + + slope = sumXY - sumX * yMean; + do_div(slope, denominator); + td->line_slope = slope; +} + bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, struct bio *bio) { @@ -1901,11 +2048,14 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, spin_lock_irq(q->queue_lock); + throtl_calculate_line_slope(tg->td); + if (unlikely(blk_queue_bypass(q))) goto out_unlock; bio_associate_current(bio); bio->bi_cg_private = q; + bio->bi_cg_size = bio_sectors(bio); blk_throtl_update_ttime(tg); @@ -1992,8 +2142,11 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, * don't want bios to leave with the flag set. Clear the flag if * being issued. */ - if (!throttled) + if (!throttled) { + if (blk_queue_nonrot(q)) + bio->bi_cg_issue_time = ktime_get_ns(); bio->bi_opf &= ~REQ_THROTTLED; + } return throttled; } @@ -2003,6 +2156,9 @@ void blk_throtl_bio_endio(struct bio *bio) struct blkcg_gq *blkg; struct throtl_grp *tg; struct request_queue *q; + struct throtl_data *td; + u64 finish_time; + u64 lat; q = bio->bi_cg_private; if (!q) @@ -2019,7 +2175,27 @@ void blk_throtl_bio_endio(struct bio *bio) tg = blkg_to_tg(blkg ?: q->root_blkg); - tg->last_finish_time = ktime_get_ns(); + finish_time = ktime_get_ns(); + tg->last_finish_time = finish_time; + + td = tg->td; + + if (bio->bi_cg_issue_time && finish_time > bio->bi_cg_issue_time) { + int index; + + lat = finish_time - bio->bi_cg_issue_time; + index = request_bucket_index(bio->bi_cg_size); + if (index >= 0 && bio_op(bio) == REQ_OP_READ && + td->limit_index == LIMIT_HIGH) { + struct latency_bucket *latency; + + latency = get_cpu_ptr(td->latency_buckets); + latency[index].total_latency += lat; + latency[index].total_size += bio->bi_cg_size << 9; + latency[index].samples++; + put_cpu_ptr(td->latency_buckets); + } + } end: rcu_read_unlock(); @@ -2097,6 +2273,12 @@ int blk_throtl_init(struct request_queue *q) td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node); if (!td) return -ENOMEM; + td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) * + LATENCY_BUCKET_SIZE, __alignof__(u64)); + if (!td->latency_buckets) { + kfree(td); + return -ENOMEM; + } INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn); throtl_service_queue_init(&td->service_queue); @@ -2113,8 +2295,10 @@ int blk_throtl_init(struct request_queue *q) td->idle_ttime_threshold = -1; /* activate policy */ ret = blkcg_activate_policy(q, &blkcg_policy_throtl); - if (ret) + if (ret) { + free_percpu(td->latency_buckets); kfree(td); + } return ret; } @@ -2123,6 +2307,7 @@ void blk_throtl_exit(struct request_queue *q) BUG_ON(!q->td); throtl_shutdown_wq(q); blkcg_deactivate_policy(q, &blkcg_policy_throtl); + free_percpu(q->td->latency_buckets); kfree(q->td); } diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index ff8dd24..45bb437 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -60,6 +60,8 @@ struct bio { struct io_context *bi_ioc; struct cgroup_subsys_state *bi_css; void *bi_cg_private; + u64 bi_cg_issue_time; + sector_t bi_cg_size; #endif union { #if defined(CONFIG_BLK_DEV_INTEGRITY)
We try to set a latency target for each cgroup. The problem is latency highly depends on request size, users can't configure the target for every request size. The idea is users configure latency target for 4k IO, we estimate the target latency for other request size IO. To do this, we sample some data, eg, average latency for request size 4k, 8k, 16k, 32k, 64k. We then use an equation f(x) = a * x + b to fit the data (x is request size in KB, f(x) is the latency). Then we can use the equation to estimate IO target latency for any request. To increase the chance of sampling, we actually collect data for any IO size less than 64k, then calcualte an average latency/size. This is ok for line fit because the equation should work for average request size/latency too. But we shouldn't sample data at any time. If disk is congested, the calculated data will not represent the disk's capability. Hence we only do the sampling when block throttling is in the HIGH limit, with assumption disk isn't congested in such state. If the assumption isn't true, eg, high limit is too high, calculated latency target will be higher. How does the equation fit to actual data? I collected data from 4 different SSDs (one SATA, 3 NVMe). The error range is quite small. The big difference between measured latency and calculated latency generally comes from 4k IO. The biggest one has around 30% difference, which isn't terrible as we don't need accurate latency target. We don't know if line fit works for other SSDs though. For big request size latency, the error range seems big. But this mechanism is to determine if we should throttle IO (eg, if cgroup is idle). If cgroups average request size is big, we can simply treat it as busy, hence we don't need the mechanism. Hard disk is completely different. Latency depends on spindle seek instead of request size. So this latency target feature is for SSD only. The patch uses below algorithm to calculate the equation: https://en.wikipedia.org/wiki/Simple_linear_regression TODO: the latency sampling is better moving to request layer Signed-off-by: Shaohua Li <shli@fb.com> --- block/blk-throttle.c | 191 +++++++++++++++++++++++++++++++++++++++++++++- include/linux/blk_types.h | 2 + 2 files changed, 190 insertions(+), 3 deletions(-)