diff mbox

[V4,15/15] blk-throttle: add latency target support

Message ID 420a0f26dd7a20ad8316258c81cb64043134bc86.1479161136.git.shli@fb.com (mailing list archive)
State New, archived
Headers show

Commit Message

Shaohua Li Nov. 14, 2016, 10:22 p.m. UTC
One hard problem adding .high limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its high limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we add a latency based approach.

We calculate the average request size and average latency of a cgroup.
Then we can calculate the target latency for the cgroup with the average
request size and the equation. In queue LIMIT_HIGH state, if a cgroup
doesn't dispatch enough IO against high limit but its average latency is
lower than its target latency, we treat the cgroup idle. In this case
other cgroups can dispatch more IO, eg, across their high limit.
Similarly in queue LIMIT_MAX state, if a cgroup doesn't dispatch enough
IO but its average latency is higher than its target latency, we treat
the cgroup busy. In this case, we should throttle other cgroups to make
the first cgroup's latency lower.

If cgroup's average request size is big (currently sets to 128k), we
always treat the cgroup busy (the think time check is still effective
though).

Currently this latency target check is only for SSD as we can't
calcualte the latency target for hard disk. And this is only for cgroup
leaf node so far.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c      | 58 ++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/blk_types.h |  1 +
 2 files changed, 56 insertions(+), 3 deletions(-)

Comments

Tejun Heo Nov. 29, 2016, 5:31 p.m. UTC | #1
Hello,

On Mon, Nov 14, 2016 at 02:22:22PM -0800, Shaohua Li wrote:
> One hard problem adding .high limit is to detect idle cgroup. If one
> cgroup doesn't dispatch enough IO against its high limit, we must have a
> mechanism to determine if other cgroups dispatch more IO. We added the
> think time detection mechanism before, but it doesn't work for all
> workloads. Here we add a latency based approach.

As I wrote before, I think that the two mechanisms should operate on
two mostly separate aspects of io control - latency control for
arbitrating active cgroups and idle detection to count out cgroups
which are sitting doing nothing - instead of the two meachanisms
possibly competing.

>  static bool throtl_tg_is_idle(struct throtl_grp *tg)
>  {
> -	/* cgroup is idle if average think time is more than threshold */
> -	return ktime_get_ns() - tg->last_finish_time >
> +	/*
> +	 * cgroup is idle if:
> +	 * 1. average think time is higher than threshold
> +	 * 2. average request size is small and average latency is higher
                                                                   ^
								   lower, right?
> +	 *    than target
> +	 */

So, this looks like too much magic to me.  How would one configure for
a workload which may issue small IOs, say, every few seconds but
requries low latency?

Thanks.
Shaohua Li Nov. 29, 2016, 6:14 p.m. UTC | #2
On Tue, Nov 29, 2016 at 12:31:08PM -0500, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 14, 2016 at 02:22:22PM -0800, Shaohua Li wrote:
> > One hard problem adding .high limit is to detect idle cgroup. If one
> > cgroup doesn't dispatch enough IO against its high limit, we must have a
> > mechanism to determine if other cgroups dispatch more IO. We added the
> > think time detection mechanism before, but it doesn't work for all
> > workloads. Here we add a latency based approach.
> 
> As I wrote before, I think that the two mechanisms should operate on
> two mostly separate aspects of io control - latency control for
> arbitrating active cgroups and idle detection to count out cgroups
> which are sitting doing nothing - instead of the two meachanisms
> possibly competing.

What the patches do doesn't conflict what you are talking about. We need a way
to detect if cgroups are idle or active. I think the problem is how to define
'active' and 'idle'. We must quantify the state. We could use:
1. plain idle detection
2. think time idle detection

1 is a subset of 2. Both need a knob to specify the time. 2 is more generic.
Probably the function name 'throtl_tg_is_idle' is misleading. It really means
'the cgroup's high limit can be ignored, other cgorups can dispatch more IO'
 
> >  static bool throtl_tg_is_idle(struct throtl_grp *tg)
> >  {
> > -	/* cgroup is idle if average think time is more than threshold */
> > -	return ktime_get_ns() - tg->last_finish_time >
> > +	/*
> > +	 * cgroup is idle if:
> > +	 * 1. average think time is higher than threshold
> > +	 * 2. average request size is small and average latency is higher
>                                                                    ^
> 								   lower, right?
oh, yes

> > +	 *    than target
> > +	 */
> 
> So, this looks like too much magic to me.  How would one configure for
> a workload which may issue small IOs, say, every few seconds but
> requries low latency?

configure the think time threshold to several seconds and configure the latency
target, it should do the job.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 29, 2016, 10:54 p.m. UTC | #3
Hello,

On Tue, Nov 29, 2016 at 10:14:03AM -0800, Shaohua Li wrote:
> What the patches do doesn't conflict what you are talking about. We need a way
> to detect if cgroups are idle or active. I think the problem is how to define
> 'active' and 'idle'. We must quantify the state. We could use:
> 1. plain idle detection
> 2. think time idle detection
> 
> 1 is a subset of 2. Both need a knob to specify the time. 2 is more generic.
> Probably the function name 'throtl_tg_is_idle' is misleading. It really means
> 'the cgroup's high limit can be ignored, other cgorups can dispatch more IO'

Yeah, both work towards about the same goal.  I feel a bit icky about
using thinktime as it seems more complicated than called for here.

> > >  static bool throtl_tg_is_idle(struct throtl_grp *tg)
> > >  {
> > > -	/* cgroup is idle if average think time is more than threshold */
> > > -	return ktime_get_ns() - tg->last_finish_time >
> > > +	/*
> > > +	 * cgroup is idle if:
> > > +	 * 1. average think time is higher than threshold
> > > +	 * 2. average request size is small and average latency is higher
> >                                                                    ^
> > 								   lower, right?
> oh, yes
> 
> > > +	 *    than target
> > > +	 */
> > 
> > So, this looks like too much magic to me.  How would one configure for
> > a workload which may issue small IOs, say, every few seconds but
> > requries low latency?
> 
> configure the think time threshold to several seconds and configure the latency
> target, it should do the job.

Sure, with a high enough number, it'd do the same thing but it's a
fuzzy number which can be difficult to tell from user's point of view.
Implementation-wise, this isn't a huge difference but I'm worried that
this can fall into the trap of "this isn't doing what I'm expecing it
to" - "try to nudge that number a bit" situation.

If we have latency target and a dumb idle setting.  Each's role is
clear - latency target determines the guarantee that we want to give
to that cgroup and accordingly how much utilization we're willing to
sacrifice for that, and idle period to ignore the cgroup if it's idle
for a relatively long term.  The distinction between the two knobs is
fairly clear.

With thinktime, the roles of each knob seem more muddled in that
thinktime would be a knob which can also be used to fine-tune
not-too-active sharing.

Most of our differences might be coming from where we assign
importance.  I think that if a cgroup wants to have latency target, it
should be the primary parameter and followed as strictly and clearly
as possible even if that means lower overall utilization.  If a cgroup
issues IOs sporadically and thinktime can increase utilization
(compared to dumb idle detection), that means that the cgroup wouldn't
be getting the target latency that it configured.  If such situation
is acceptable, wouldn't it make sense to lower the target latency
instead?

Thanks.
Shaohua Li Nov. 29, 2016, 11:39 p.m. UTC | #4
On Tue, Nov 29, 2016 at 05:54:46PM -0500, Tejun Heo wrote:
> Hello,
> 
> On Tue, Nov 29, 2016 at 10:14:03AM -0800, Shaohua Li wrote:
> > What the patches do doesn't conflict what you are talking about. We need a way
> > to detect if cgroups are idle or active. I think the problem is how to define
> > 'active' and 'idle'. We must quantify the state. We could use:
> > 1. plain idle detection
> > 2. think time idle detection
> > 
> > 1 is a subset of 2. Both need a knob to specify the time. 2 is more generic.
> > Probably the function name 'throtl_tg_is_idle' is misleading. It really means
> > 'the cgroup's high limit can be ignored, other cgorups can dispatch more IO'
> 
> Yeah, both work towards about the same goal.  I feel a bit icky about
> using thinktime as it seems more complicated than called for here.
> 
> > > >  static bool throtl_tg_is_idle(struct throtl_grp *tg)
> > > >  {
> > > > -	/* cgroup is idle if average think time is more than threshold */
> > > > -	return ktime_get_ns() - tg->last_finish_time >
> > > > +	/*
> > > > +	 * cgroup is idle if:
> > > > +	 * 1. average think time is higher than threshold
> > > > +	 * 2. average request size is small and average latency is higher
> > >                                                                    ^
> > > 								   lower, right?
> > oh, yes
> > 
> > > > +	 *    than target
> > > > +	 */
> > > 
> > > So, this looks like too much magic to me.  How would one configure for
> > > a workload which may issue small IOs, say, every few seconds but
> > > requries low latency?
> > 
> > configure the think time threshold to several seconds and configure the latency
> > target, it should do the job.
> 
> Sure, with a high enough number, it'd do the same thing but it's a
> fuzzy number which can be difficult to tell from user's point of view.
> Implementation-wise, this isn't a huge difference but I'm worried that
> this can fall into the trap of "this isn't doing what I'm expecing it
> to" - "try to nudge that number a bit" situation.
> 
> If we have latency target and a dumb idle setting.  Each's role is
> clear - latency target determines the guarantee that we want to give
> to that cgroup and accordingly how much utilization we're willing to
> sacrifice for that, and idle period to ignore the cgroup if it's idle
> for a relatively long term.  The distinction between the two knobs is
> fairly clear.
> 
> With thinktime, the roles of each knob seem more muddled in that
> thinktime would be a knob which can also be used to fine-tune
> not-too-active sharing.

The dumb idle or think time idle is about implementation choice. Let me take
this way. Defien a knob called 'idle_time'. In the first implementation, we
implement the knob as dump idle. Later we implement it as think time idle.
Would this make you feel better? Or just using the new name 'idle_time' alreay
makes you happy?

For dump idle, we probably can't let user configure the 'idle_time' too small
though.

> Most of our differences might be coming from where we assign
> importance.  I think that if a cgroup wants to have latency target, it
> should be the primary parameter and followed as strictly and clearly
> as possible even if that means lower overall utilization.  If a cgroup
> issues IOs sporadically and thinktime can increase utilization
> (compared to dumb idle detection), that means that the cgroup wouldn't
> be getting the target latency that it configured.  If such situation
> is acceptable, wouldn't it make sense to lower the target latency
> instead?

lowering the target latency doesn't really help. In a giving latency target,
cgroup can dispatch 1 IO per second or 1000 IO per second. The reality is if
application stops dispatching IO (idle) and if application's IO latency is high
haven't any relationship.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index ac4d9ea..d07f332 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -156,6 +156,12 @@  struct throtl_grp {
 	u64 last_finish_time;
 	u64 checked_last_finish_time;
 	u64 avg_ttime;
+
+	unsigned int bio_batch;
+	u64 total_latency;
+	u64 avg_latency;
+	u64 total_size;
+	u64 avg_size;
 };
 
 /* We measure latency for request size from 4k to 4k * ( 1 << 4) */
@@ -1734,12 +1740,30 @@  static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
 	return ret;
 }
 
+static u64 throtl_target_latency(struct throtl_data *td,
+	struct throtl_grp *tg)
+{
+	if (td->line_slope == 0 || tg->latency_target == 0)
+		return 0;
+
+	/* latency_target + f(avg_size) - f(4k) */
+	return td->line_slope * ((tg->avg_size >> 10) - 4) +
+		tg->latency_target;
+}
+
 static bool throtl_tg_is_idle(struct throtl_grp *tg)
 {
-	/* cgroup is idle if average think time is more than threshold */
-	return ktime_get_ns() - tg->last_finish_time >
+	/*
+	 * cgroup is idle if:
+	 * 1. average think time is higher than threshold
+	 * 2. average request size is small and average latency is higher
+	 *    than target
+	 */
+	return (ktime_get_ns() - tg->last_finish_time >
 		4 * tg->td->idle_ttime_threshold ||
-	       tg->avg_ttime > tg->td->idle_ttime_threshold;
+		tg->avg_ttime > tg->td->idle_ttime_threshold) ||
+	       (tg->avg_latency && tg->avg_size && tg->avg_size <= 128 * 1024 &&
+		tg->avg_latency < throtl_target_latency(tg->td, tg));
 }
 
 static bool throtl_upgrade_check_one(struct throtl_grp *tg)
@@ -2123,6 +2147,7 @@  bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	bio_associate_current(bio);
 	bio->bi_cg_private = q;
 	bio->bi_cg_size = bio_sectors(bio);
+	bio->bi_cg_enter_time = ktime_get_ns();
 
 	blk_throtl_update_ttime(tg);
 
@@ -2264,6 +2289,33 @@  void blk_throtl_bio_endio(struct bio *bio)
 		}
 	}
 
+	if (bio->bi_cg_enter_time && finish_time > bio->bi_cg_enter_time &&
+	    tg->latency_target) {
+		lat = finish_time - bio->bi_cg_enter_time;
+		tg->total_latency += lat;
+		tg->total_size += bio->bi_cg_size << 9;
+		tg->bio_batch++;
+	}
+
+	if (tg->bio_batch >= 8) {
+		int batch = tg->bio_batch;
+		u64 size = tg->total_size;
+
+		lat = tg->total_latency;
+
+		tg->bio_batch = 0;
+		tg->total_latency = 0;
+		tg->total_size = 0;
+
+		if (batch) {
+			do_div(lat, batch);
+			tg->avg_latency = (tg->avg_latency * 7 +
+				lat) >> 3;
+			do_div(size, batch);
+			tg->avg_size = (tg->avg_size * 7 + size) >> 3;
+		}
+	}
+
 end:
 	rcu_read_unlock();
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 45bb437..fe87a20 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -61,6 +61,7 @@  struct bio {
 	struct cgroup_subsys_state *bi_css;
 	void *bi_cg_private;
 	u64 bi_cg_issue_time;
+	u64 bi_cg_enter_time;
 	sector_t bi_cg_size;
 #endif
 	union {