diff mbox

[V4,10/15] blk-throttle: add a simple idle detection

Message ID ba2d677b381e94a2f6c4bf5108f4906c78e99d4f.1479161136.git.shli@fb.com (mailing list archive)
State New, archived
Headers show

Commit Message

Shaohua Li Nov. 14, 2016, 10:22 p.m. UTC
A cgroup gets assigned a high limit, but the cgroup could never dispatch
enough IO to cross the high limit. In such case, the queue state machine
will remain in LIMIT_HIGH state and all other cgroups will be throttled
according to high limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to higher state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its high limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_HIGH. But
when queue gets upgraded to higher state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its high limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 50us for SSD and 1ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/bio.c               |  2 ++
 block/blk-throttle.c      | 72 ++++++++++++++++++++++++++++++++++++++++++++++-
 block/blk.h               |  2 ++
 include/linux/blk_types.h |  1 +
 4 files changed, 76 insertions(+), 1 deletion(-)

Comments

Tejun Heo Nov. 23, 2016, 9:46 p.m. UTC | #1
Hello, Shaohua.

On Mon, Nov 14, 2016 at 02:22:17PM -0800, Shaohua Li wrote:
> Unfortunately it's very hard to determine if a cgroup is real idle. This
> patch uses the 'think time check' idea from CFQ for the purpose. Please
> note, the idea doesn't work for all workloads. For example, a workload
> with io depth 8 has disk utilization 100%, hence think time is 0, eg,
> not idle. But the workload can run higher bandwidth with io depth 16.
> Compared to io depth 16, the io depth 8 workload is idle. We use the
> idea to roughly determine if a cgroup is idle.

Hmm... I'm not sure thinktime is the best measure here.  Think time is
used by cfq mainly to tell the likely future behavior of a workload so
that cfq can take speculative actions on the prediction.  However,
given that the implemented high limit behavior tries to provide a
certain level of latency target, using the predictive thinktime to
regulate behavior might lead to too unpredictable behaviors.

Moreover, I don't see why we need to bother with predictions anyway.
cfq needed it but I don't think that's the case for blk-throtl.  It
can just provide idle threshold where a cgroup which hasn't issued an
IO over that threshold is considered idle.  That'd be a lot easier to
understand and configure from userland while providing a good enough
mechanism to prevent idle cgroups from clamping down utilization for
too long.

Thanks.
Shaohua Li Nov. 24, 2016, 1:15 a.m. UTC | #2
On Wed, Nov 23, 2016 at 04:46:19PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Mon, Nov 14, 2016 at 02:22:17PM -0800, Shaohua Li wrote:
> > Unfortunately it's very hard to determine if a cgroup is real idle. This
> > patch uses the 'think time check' idea from CFQ for the purpose. Please
> > note, the idea doesn't work for all workloads. For example, a workload
> > with io depth 8 has disk utilization 100%, hence think time is 0, eg,
> > not idle. But the workload can run higher bandwidth with io depth 16.
> > Compared to io depth 16, the io depth 8 workload is idle. We use the
> > idea to roughly determine if a cgroup is idle.
> 
> Hmm... I'm not sure thinktime is the best measure here.  Think time is
> used by cfq mainly to tell the likely future behavior of a workload so
> that cfq can take speculative actions on the prediction.  However,
> given that the implemented high limit behavior tries to provide a
> certain level of latency target, using the predictive thinktime to
> regulate behavior might lead to too unpredictable behaviors.

Latency just reflects one side of the IO. Latency and think time haven't any
relationship. For example, a cgroup dispatching 1 IO per second can still have
high latency. If we only take latency account, we will think the cgroup is
busy, which is not justified.
 
> Moreover, I don't see why we need to bother with predictions anyway.
> cfq needed it but I don't think that's the case for blk-throtl.  It
> can just provide idle threshold where a cgroup which hasn't issued an
> IO over that threshold is considered idle.  That'd be a lot easier to
> understand and configure from userland while providing a good enough
> mechanism to prevent idle cgroups from clamping down utilization for
> too long.

We could do this, but it will only work for very idle workload, eg, the
workload is completely idle. If workload dispatches IO sporadically, this will
likely not work. The average think time is more precise for predication.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 28, 2016, 10:21 p.m. UTC | #3
Hello, Shaohua.

On Wed, Nov 23, 2016 at 05:15:18PM -0800, Shaohua Li wrote:
> > Hmm... I'm not sure thinktime is the best measure here.  Think time is
> > used by cfq mainly to tell the likely future behavior of a workload so
> > that cfq can take speculative actions on the prediction.  However,
> > given that the implemented high limit behavior tries to provide a
> > certain level of latency target, using the predictive thinktime to
> > regulate behavior might lead to too unpredictable behaviors.
> 
> Latency just reflects one side of the IO. Latency and think time haven't any
> relationship. For example, a cgroup dispatching 1 IO per second can still have
> high latency. If we only take latency account, we will think the cgroup is
> busy, which is not justified.

Yes, the two are indepndent metrics; however, whether a cgroup is
considered idle or not affects whether blk-throttle will adhere to the
latency target or not.  Thinktime is a magic number which can be good
but whose behavior can be very difficult to predict from outside the
black box.  What I was trying to say was that putting in thinktime
here can greatly weaken the configured latency target in unobvious
ways.

> > Moreover, I don't see why we need to bother with predictions anyway.
> > cfq needed it but I don't think that's the case for blk-throtl.  It
> > can just provide idle threshold where a cgroup which hasn't issued an
> > IO over that threshold is considered idle.  That'd be a lot easier to
> > understand and configure from userland while providing a good enough
> > mechanism to prevent idle cgroups from clamping down utilization for
> > too long.
> 
> We could do this, but it will only work for very idle workload, eg, the
> workload is completely idle. If workload dispatches IO sporadically, this will
> likely not work. The average think time is more precise for predication.

But we can increase sharing by upping the target latency.  That should
be the main knob - if low, the user wants stricter service guarantee
at the cost of lower overall utilization; if high, the workload can
deal with higher latency and the system can achieve higher overall
utilization.  I think the idle detection should be an extra mechanism
which can be used to ignore cgroup-disk combinations which are staying
idle for a long time.

Thanks.
Shaohua Li Nov. 28, 2016, 11:10 p.m. UTC | #4
On Mon, Nov 28, 2016 at 05:21:48PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Wed, Nov 23, 2016 at 05:15:18PM -0800, Shaohua Li wrote:
> > > Hmm... I'm not sure thinktime is the best measure here.  Think time is
> > > used by cfq mainly to tell the likely future behavior of a workload so
> > > that cfq can take speculative actions on the prediction.  However,
> > > given that the implemented high limit behavior tries to provide a
> > > certain level of latency target, using the predictive thinktime to
> > > regulate behavior might lead to too unpredictable behaviors.
> > 
> > Latency just reflects one side of the IO. Latency and think time haven't any
> > relationship. For example, a cgroup dispatching 1 IO per second can still have
> > high latency. If we only take latency account, we will think the cgroup is
> > busy, which is not justified.
> 
> Yes, the two are indepndent metrics; however, whether a cgroup is
> considered idle or not affects whether blk-throttle will adhere to the
> latency target or not.  Thinktime is a magic number which can be good
> but whose behavior can be very difficult to predict from outside the
> black box.  What I was trying to say was that putting in thinktime
> here can greatly weaken the configured latency target in unobvious
> ways.
> 
> > > Moreover, I don't see why we need to bother with predictions anyway.
> > > cfq needed it but I don't think that's the case for blk-throtl.  It
> > > can just provide idle threshold where a cgroup which hasn't issued an
> > > IO over that threshold is considered idle.  That'd be a lot easier to
> > > understand and configure from userland while providing a good enough
> > > mechanism to prevent idle cgroups from clamping down utilization for
> > > too long.
> > 
> > We could do this, but it will only work for very idle workload, eg, the
> > workload is completely idle. If workload dispatches IO sporadically, this will
> > likely not work. The average think time is more precise for predication.
> 
> But we can increase sharing by upping the target latency.  That should
> be the main knob - if low, the user wants stricter service guarantee
> at the cost of lower overall utilization; if high, the workload can
> deal with higher latency and the system can achieve higher overall
> utilization.  I think the idle detection should be an extra mechanism
> which can be used to ignore cgroup-disk combinations which are staying
> idle for a long time.

Yes, we can increase target latency to increase sharing. But latency and think
time are different. In the example I mentioned earlier, we must increase the
latency target very big to increase sharing even the cgroup just sends 1 IO per
second. Don't think this's what users want. In a summary, we can't only use
latency to determine if cgroups could dispatch more IO.

Currently the think time idle detection is an extra mechanism to ignore cgroup
limit. So we currently we only ignore cgroup limit when think time is big or
latency is small. This does make the behavior a little bit difficult to
predict, eg, not respect latency target sometimes, but this is necessary to
have better sharing.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 29, 2016, 5:08 p.m. UTC | #5
Hello, Shaohua.

On Mon, Nov 28, 2016 at 03:10:18PM -0800, Shaohua Li wrote:
> > But we can increase sharing by upping the target latency.  That should
> > be the main knob - if low, the user wants stricter service guarantee
> > at the cost of lower overall utilization; if high, the workload can
> > deal with higher latency and the system can achieve higher overall
> > utilization.  I think the idle detection should be an extra mechanism
> > which can be used to ignore cgroup-disk combinations which are staying
> > idle for a long time.
> 
> Yes, we can increase target latency to increase sharing. But latency and think
> time are different. In the example I mentioned earlier, we must increase the
> latency target very big to increase sharing even the cgroup just sends 1 IO per
> second. Don't think this's what users want. In a summary, we can't only use
> latency to determine if cgroups could dispatch more IO.
> 
> Currently the think time idle detection is an extra mechanism to ignore cgroup
> limit. So we currently we only ignore cgroup limit when think time is big or
> latency is small. This does make the behavior a little bit difficult to
> predict, eg, not respect latency target sometimes, but this is necessary to
> have better sharing.

So, it's not like we can get better sharing for free.  It always comes
at the cost of (best effort) latency guarantee.  Using thinktime for
idle detection doesn't mean that we get higher utilization for free.
If we get higher utilization by using thinktime instead of plain idle
detection, it means that we're sacrificing latency guarantee more with
thinktime, so I don't think the argument that using thinktime leads to
higher utilization is a clear winner.

That is not to say that there's no benefit to thinktime.  I can
imagine cases where it'd allow us to ride the line between acceptable
latency and good overall utilization better; however, that also comes
with cases where one has to wonder "what's going on?  I have no idea
what it's doing".

Given that blk-throttle is gonna ask for explicit and detailed
configuration from its users, I think it's vital that it has config
knobs which are immediately clear.  Being tedious is already a burden
and I don't think adding unpredictability there is a good idea.

Thanks.
diff mbox

Patch

diff --git a/block/bio.c b/block/bio.c
index db85c57..7baa86d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -30,6 +30,7 @@ 
 #include <linux/cgroup.h>
 
 #include <trace/events/block.h>
+#include "blk.h"
 
 /*
  * Test patch to inline a certain number of bi_io_vec's inside the bio
@@ -1759,6 +1760,7 @@  void bio_endio(struct bio *bio)
 		goto again;
 	}
 
+	blk_throtl_bio_endio(bio);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 45a28c4..cb5fd85 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -21,6 +21,8 @@  static int throtl_quantum = 32;
 /* Throttling is performed over 100ms slice and after that slice is renewed */
 #define DFL_THROTL_SLICE (HZ / 10)
 #define MAX_THROTL_SLICE (HZ / 5)
+#define DFL_IDLE_THRESHOLD_SSD (50 * 1000) /* 50 us */
+#define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -149,6 +151,10 @@  struct throtl_grp {
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
+
+	u64 last_finish_time;
+	u64 checked_last_finish_time;
+	u64 avg_ttime;
 };
 
 struct throtl_data
@@ -172,6 +178,8 @@  struct throtl_data
 	unsigned long high_downgrade_time;
 
 	unsigned int scale;
+
+	u64 idle_ttime_threshold;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -1629,6 +1637,14 @@  static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
 	return ret;
 }
 
+static bool throtl_tg_is_idle(struct throtl_grp *tg)
+{
+	/* cgroup is idle if average think time is more than threshold */
+	return ktime_get_ns() - tg->last_finish_time >
+		4 * tg->td->idle_ttime_threshold ||
+	       tg->avg_ttime > tg->td->idle_ttime_threshold;
+}
+
 static bool throtl_upgrade_check_one(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1837,6 +1853,19 @@  static void throtl_downgrade_check(struct throtl_grp *tg)
 	tg->last_io_disp[WRITE] = 0;
 }
 
+static void blk_throtl_update_ttime(struct throtl_grp *tg)
+{
+	u64 now = ktime_get_ns();
+	u64 last_finish_time = tg->last_finish_time;
+
+	if (now <= last_finish_time || last_finish_time == 0 ||
+	    last_finish_time == tg->checked_last_finish_time)
+		return;
+
+	tg->avg_ttime = (tg->avg_ttime * 7 + now - last_finish_time) >> 3;
+	tg->checked_last_finish_time = last_finish_time;
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1848,6 +1877,13 @@  bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
+	if (tg->td->idle_ttime_threshold == -1) {
+		if (blk_queue_nonrot(q))
+			tg->td->idle_ttime_threshold = DFL_IDLE_THRESHOLD_SSD;
+		else
+			tg->td->idle_ttime_threshold = DFL_IDLE_THRESHOLD_HD;
+	}
+
 	/* see throtl_charge_bio() */
 	if ((bio->bi_opf & REQ_THROTTLED) || !tg->has_rules[rw])
 		goto out;
@@ -1857,6 +1893,11 @@  bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	if (unlikely(blk_queue_bypass(q)))
 		goto out_unlock;
 
+	bio_associate_current(bio);
+	bio->bi_cg_private = q;
+
+	blk_throtl_update_ttime(tg);
+
 	sq = &tg->service_queue;
 
 again:
@@ -1917,7 +1958,6 @@  bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	tg->last_high_overflow_time[rw] = jiffies;
 
-	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;
@@ -1946,6 +1986,34 @@  bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	return throttled;
 }
 
+void blk_throtl_bio_endio(struct bio *bio)
+{
+	struct blkcg *blkcg;
+	struct blkcg_gq *blkg;
+	struct throtl_grp *tg;
+	struct request_queue *q;
+
+	q = bio->bi_cg_private;
+	if (!q)
+		return;
+	bio->bi_cg_private = NULL;
+
+	rcu_read_lock();
+	blkcg = bio_blkcg(bio);
+	if (!blkcg)
+		goto end;
+	blkg = blkg_lookup(blkcg, q);
+	if (!blkg)
+		goto end;
+
+	tg = blkg_to_tg(blkg ?: q->root_blkg);
+
+	tg->last_finish_time = ktime_get_ns();
+
+end:
+	rcu_read_unlock();
+}
+
 /*
  * Dispatch all bios from all children tg's queued on @parent_sq.  On
  * return, @parent_sq is guaranteed to not have any active children tg's
@@ -2030,6 +2098,8 @@  int blk_throtl_init(struct request_queue *q)
 	td->limit_index = LIMIT_MAX;
 	td->high_upgrade_time = jiffies;
 	td->high_downgrade_time = jiffies;
+
+	td->idle_ttime_threshold = -1;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
diff --git a/block/blk.h b/block/blk.h
index 39c14dd..b433f35 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -292,10 +292,12 @@  extern void blk_throtl_exit(struct request_queue *q);
 extern ssize_t blk_throtl_slice_show(struct request_queue *q, char *page);
 extern ssize_t blk_throtl_slice_store(struct request_queue *q,
 	const char *page, size_t count);
+extern void blk_throtl_bio_endio(struct bio *bio);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
+static inline void blk_throtl_bio_endio(struct bio *bio) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cd395ec..ff8dd24 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -59,6 +59,7 @@  struct bio {
 	 */
 	struct io_context	*bi_ioc;
 	struct cgroup_subsys_state *bi_css;
+	void *bi_cg_private;
 #endif
 	union {
 #if defined(CONFIG_BLK_DEV_INTEGRITY)