Message ID | ba2d677b381e94a2f6c4bf5108f4906c78e99d4f.1479161136.git.shli@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hello, Shaohua. On Mon, Nov 14, 2016 at 02:22:17PM -0800, Shaohua Li wrote: > Unfortunately it's very hard to determine if a cgroup is real idle. This > patch uses the 'think time check' idea from CFQ for the purpose. Please > note, the idea doesn't work for all workloads. For example, a workload > with io depth 8 has disk utilization 100%, hence think time is 0, eg, > not idle. But the workload can run higher bandwidth with io depth 16. > Compared to io depth 16, the io depth 8 workload is idle. We use the > idea to roughly determine if a cgroup is idle. Hmm... I'm not sure thinktime is the best measure here. Think time is used by cfq mainly to tell the likely future behavior of a workload so that cfq can take speculative actions on the prediction. However, given that the implemented high limit behavior tries to provide a certain level of latency target, using the predictive thinktime to regulate behavior might lead to too unpredictable behaviors. Moreover, I don't see why we need to bother with predictions anyway. cfq needed it but I don't think that's the case for blk-throtl. It can just provide idle threshold where a cgroup which hasn't issued an IO over that threshold is considered idle. That'd be a lot easier to understand and configure from userland while providing a good enough mechanism to prevent idle cgroups from clamping down utilization for too long. Thanks.
On Wed, Nov 23, 2016 at 04:46:19PM -0500, Tejun Heo wrote: > Hello, Shaohua. > > On Mon, Nov 14, 2016 at 02:22:17PM -0800, Shaohua Li wrote: > > Unfortunately it's very hard to determine if a cgroup is real idle. This > > patch uses the 'think time check' idea from CFQ for the purpose. Please > > note, the idea doesn't work for all workloads. For example, a workload > > with io depth 8 has disk utilization 100%, hence think time is 0, eg, > > not idle. But the workload can run higher bandwidth with io depth 16. > > Compared to io depth 16, the io depth 8 workload is idle. We use the > > idea to roughly determine if a cgroup is idle. > > Hmm... I'm not sure thinktime is the best measure here. Think time is > used by cfq mainly to tell the likely future behavior of a workload so > that cfq can take speculative actions on the prediction. However, > given that the implemented high limit behavior tries to provide a > certain level of latency target, using the predictive thinktime to > regulate behavior might lead to too unpredictable behaviors. Latency just reflects one side of the IO. Latency and think time haven't any relationship. For example, a cgroup dispatching 1 IO per second can still have high latency. If we only take latency account, we will think the cgroup is busy, which is not justified. > Moreover, I don't see why we need to bother with predictions anyway. > cfq needed it but I don't think that's the case for blk-throtl. It > can just provide idle threshold where a cgroup which hasn't issued an > IO over that threshold is considered idle. That'd be a lot easier to > understand and configure from userland while providing a good enough > mechanism to prevent idle cgroups from clamping down utilization for > too long. We could do this, but it will only work for very idle workload, eg, the workload is completely idle. If workload dispatches IO sporadically, this will likely not work. The average think time is more precise for predication. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Shaohua. On Wed, Nov 23, 2016 at 05:15:18PM -0800, Shaohua Li wrote: > > Hmm... I'm not sure thinktime is the best measure here. Think time is > > used by cfq mainly to tell the likely future behavior of a workload so > > that cfq can take speculative actions on the prediction. However, > > given that the implemented high limit behavior tries to provide a > > certain level of latency target, using the predictive thinktime to > > regulate behavior might lead to too unpredictable behaviors. > > Latency just reflects one side of the IO. Latency and think time haven't any > relationship. For example, a cgroup dispatching 1 IO per second can still have > high latency. If we only take latency account, we will think the cgroup is > busy, which is not justified. Yes, the two are indepndent metrics; however, whether a cgroup is considered idle or not affects whether blk-throttle will adhere to the latency target or not. Thinktime is a magic number which can be good but whose behavior can be very difficult to predict from outside the black box. What I was trying to say was that putting in thinktime here can greatly weaken the configured latency target in unobvious ways. > > Moreover, I don't see why we need to bother with predictions anyway. > > cfq needed it but I don't think that's the case for blk-throtl. It > > can just provide idle threshold where a cgroup which hasn't issued an > > IO over that threshold is considered idle. That'd be a lot easier to > > understand and configure from userland while providing a good enough > > mechanism to prevent idle cgroups from clamping down utilization for > > too long. > > We could do this, but it will only work for very idle workload, eg, the > workload is completely idle. If workload dispatches IO sporadically, this will > likely not work. The average think time is more precise for predication. But we can increase sharing by upping the target latency. That should be the main knob - if low, the user wants stricter service guarantee at the cost of lower overall utilization; if high, the workload can deal with higher latency and the system can achieve higher overall utilization. I think the idle detection should be an extra mechanism which can be used to ignore cgroup-disk combinations which are staying idle for a long time. Thanks.
On Mon, Nov 28, 2016 at 05:21:48PM -0500, Tejun Heo wrote: > Hello, Shaohua. > > On Wed, Nov 23, 2016 at 05:15:18PM -0800, Shaohua Li wrote: > > > Hmm... I'm not sure thinktime is the best measure here. Think time is > > > used by cfq mainly to tell the likely future behavior of a workload so > > > that cfq can take speculative actions on the prediction. However, > > > given that the implemented high limit behavior tries to provide a > > > certain level of latency target, using the predictive thinktime to > > > regulate behavior might lead to too unpredictable behaviors. > > > > Latency just reflects one side of the IO. Latency and think time haven't any > > relationship. For example, a cgroup dispatching 1 IO per second can still have > > high latency. If we only take latency account, we will think the cgroup is > > busy, which is not justified. > > Yes, the two are indepndent metrics; however, whether a cgroup is > considered idle or not affects whether blk-throttle will adhere to the > latency target or not. Thinktime is a magic number which can be good > but whose behavior can be very difficult to predict from outside the > black box. What I was trying to say was that putting in thinktime > here can greatly weaken the configured latency target in unobvious > ways. > > > > Moreover, I don't see why we need to bother with predictions anyway. > > > cfq needed it but I don't think that's the case for blk-throtl. It > > > can just provide idle threshold where a cgroup which hasn't issued an > > > IO over that threshold is considered idle. That'd be a lot easier to > > > understand and configure from userland while providing a good enough > > > mechanism to prevent idle cgroups from clamping down utilization for > > > too long. > > > > We could do this, but it will only work for very idle workload, eg, the > > workload is completely idle. If workload dispatches IO sporadically, this will > > likely not work. The average think time is more precise for predication. > > But we can increase sharing by upping the target latency. That should > be the main knob - if low, the user wants stricter service guarantee > at the cost of lower overall utilization; if high, the workload can > deal with higher latency and the system can achieve higher overall > utilization. I think the idle detection should be an extra mechanism > which can be used to ignore cgroup-disk combinations which are staying > idle for a long time. Yes, we can increase target latency to increase sharing. But latency and think time are different. In the example I mentioned earlier, we must increase the latency target very big to increase sharing even the cgroup just sends 1 IO per second. Don't think this's what users want. In a summary, we can't only use latency to determine if cgroups could dispatch more IO. Currently the think time idle detection is an extra mechanism to ignore cgroup limit. So we currently we only ignore cgroup limit when think time is big or latency is small. This does make the behavior a little bit difficult to predict, eg, not respect latency target sometimes, but this is necessary to have better sharing. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Shaohua. On Mon, Nov 28, 2016 at 03:10:18PM -0800, Shaohua Li wrote: > > But we can increase sharing by upping the target latency. That should > > be the main knob - if low, the user wants stricter service guarantee > > at the cost of lower overall utilization; if high, the workload can > > deal with higher latency and the system can achieve higher overall > > utilization. I think the idle detection should be an extra mechanism > > which can be used to ignore cgroup-disk combinations which are staying > > idle for a long time. > > Yes, we can increase target latency to increase sharing. But latency and think > time are different. In the example I mentioned earlier, we must increase the > latency target very big to increase sharing even the cgroup just sends 1 IO per > second. Don't think this's what users want. In a summary, we can't only use > latency to determine if cgroups could dispatch more IO. > > Currently the think time idle detection is an extra mechanism to ignore cgroup > limit. So we currently we only ignore cgroup limit when think time is big or > latency is small. This does make the behavior a little bit difficult to > predict, eg, not respect latency target sometimes, but this is necessary to > have better sharing. So, it's not like we can get better sharing for free. It always comes at the cost of (best effort) latency guarantee. Using thinktime for idle detection doesn't mean that we get higher utilization for free. If we get higher utilization by using thinktime instead of plain idle detection, it means that we're sacrificing latency guarantee more with thinktime, so I don't think the argument that using thinktime leads to higher utilization is a clear winner. That is not to say that there's no benefit to thinktime. I can imagine cases where it'd allow us to ride the line between acceptable latency and good overall utilization better; however, that also comes with cases where one has to wonder "what's going on? I have no idea what it's doing". Given that blk-throttle is gonna ask for explicit and detailed configuration from its users, I think it's vital that it has config knobs which are immediately clear. Being tedious is already a burden and I don't think adding unpredictability there is a good idea. Thanks.
diff --git a/block/bio.c b/block/bio.c index db85c57..7baa86d 100644 --- a/block/bio.c +++ b/block/bio.c @@ -30,6 +30,7 @@ #include <linux/cgroup.h> #include <trace/events/block.h> +#include "blk.h" /* * Test patch to inline a certain number of bi_io_vec's inside the bio @@ -1759,6 +1760,7 @@ void bio_endio(struct bio *bio) goto again; } + blk_throtl_bio_endio(bio); if (bio->bi_end_io) bio->bi_end_io(bio); } diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 45a28c4..cb5fd85 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -21,6 +21,8 @@ static int throtl_quantum = 32; /* Throttling is performed over 100ms slice and after that slice is renewed */ #define DFL_THROTL_SLICE (HZ / 10) #define MAX_THROTL_SLICE (HZ / 5) +#define DFL_IDLE_THRESHOLD_SSD (50 * 1000) /* 50 us */ +#define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */ static struct blkcg_policy blkcg_policy_throtl; @@ -149,6 +151,10 @@ struct throtl_grp { /* When did we start a new slice */ unsigned long slice_start[2]; unsigned long slice_end[2]; + + u64 last_finish_time; + u64 checked_last_finish_time; + u64 avg_ttime; }; struct throtl_data @@ -172,6 +178,8 @@ struct throtl_data unsigned long high_downgrade_time; unsigned int scale; + + u64 idle_ttime_threshold; }; static void throtl_pending_timer_fn(unsigned long arg); @@ -1629,6 +1637,14 @@ static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg) return ret; } +static bool throtl_tg_is_idle(struct throtl_grp *tg) +{ + /* cgroup is idle if average think time is more than threshold */ + return ktime_get_ns() - tg->last_finish_time > + 4 * tg->td->idle_ttime_threshold || + tg->avg_ttime > tg->td->idle_ttime_threshold; +} + static bool throtl_upgrade_check_one(struct throtl_grp *tg) { struct throtl_service_queue *sq = &tg->service_queue; @@ -1837,6 +1853,19 @@ static void throtl_downgrade_check(struct throtl_grp *tg) tg->last_io_disp[WRITE] = 0; } +static void blk_throtl_update_ttime(struct throtl_grp *tg) +{ + u64 now = ktime_get_ns(); + u64 last_finish_time = tg->last_finish_time; + + if (now <= last_finish_time || last_finish_time == 0 || + last_finish_time == tg->checked_last_finish_time) + return; + + tg->avg_ttime = (tg->avg_ttime * 7 + now - last_finish_time) >> 3; + tg->checked_last_finish_time = last_finish_time; +} + bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, struct bio *bio) { @@ -1848,6 +1877,13 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, WARN_ON_ONCE(!rcu_read_lock_held()); + if (tg->td->idle_ttime_threshold == -1) { + if (blk_queue_nonrot(q)) + tg->td->idle_ttime_threshold = DFL_IDLE_THRESHOLD_SSD; + else + tg->td->idle_ttime_threshold = DFL_IDLE_THRESHOLD_HD; + } + /* see throtl_charge_bio() */ if ((bio->bi_opf & REQ_THROTTLED) || !tg->has_rules[rw]) goto out; @@ -1857,6 +1893,11 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, if (unlikely(blk_queue_bypass(q))) goto out_unlock; + bio_associate_current(bio); + bio->bi_cg_private = q; + + blk_throtl_update_ttime(tg); + sq = &tg->service_queue; again: @@ -1917,7 +1958,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, tg->last_high_overflow_time[rw] = jiffies; - bio_associate_current(bio); tg->td->nr_queued[rw]++; throtl_add_bio_tg(bio, qn, tg); throttled = true; @@ -1946,6 +1986,34 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, return throttled; } +void blk_throtl_bio_endio(struct bio *bio) +{ + struct blkcg *blkcg; + struct blkcg_gq *blkg; + struct throtl_grp *tg; + struct request_queue *q; + + q = bio->bi_cg_private; + if (!q) + return; + bio->bi_cg_private = NULL; + + rcu_read_lock(); + blkcg = bio_blkcg(bio); + if (!blkcg) + goto end; + blkg = blkg_lookup(blkcg, q); + if (!blkg) + goto end; + + tg = blkg_to_tg(blkg ?: q->root_blkg); + + tg->last_finish_time = ktime_get_ns(); + +end: + rcu_read_unlock(); +} + /* * Dispatch all bios from all children tg's queued on @parent_sq. On * return, @parent_sq is guaranteed to not have any active children tg's @@ -2030,6 +2098,8 @@ int blk_throtl_init(struct request_queue *q) td->limit_index = LIMIT_MAX; td->high_upgrade_time = jiffies; td->high_downgrade_time = jiffies; + + td->idle_ttime_threshold = -1; /* activate policy */ ret = blkcg_activate_policy(q, &blkcg_policy_throtl); if (ret) diff --git a/block/blk.h b/block/blk.h index 39c14dd..b433f35 100644 --- a/block/blk.h +++ b/block/blk.h @@ -292,10 +292,12 @@ extern void blk_throtl_exit(struct request_queue *q); extern ssize_t blk_throtl_slice_show(struct request_queue *q, char *page); extern ssize_t blk_throtl_slice_store(struct request_queue *q, const char *page, size_t count); +extern void blk_throtl_bio_endio(struct bio *bio); #else /* CONFIG_BLK_DEV_THROTTLING */ static inline void blk_throtl_drain(struct request_queue *q) { } static inline int blk_throtl_init(struct request_queue *q) { return 0; } static inline void blk_throtl_exit(struct request_queue *q) { } +static inline void blk_throtl_bio_endio(struct bio *bio) { } #endif /* CONFIG_BLK_DEV_THROTTLING */ #endif /* BLK_INTERNAL_H */ diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index cd395ec..ff8dd24 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -59,6 +59,7 @@ struct bio { */ struct io_context *bi_ioc; struct cgroup_subsys_state *bi_css; + void *bi_cg_private; #endif union { #if defined(CONFIG_BLK_DEV_INTEGRITY)
A cgroup gets assigned a high limit, but the cgroup could never dispatch enough IO to cross the high limit. In such case, the queue state machine will remain in LIMIT_HIGH state and all other cgroups will be throttled according to high limit. This is unfair for other cgroups. We should treat the cgroup idle and upgrade the state machine to higher state. We also have a downgrade logic. If the state machine upgrades because of cgroup idle (real idle), the state machine will downgrade soon as the cgroup is below its high limit. This isn't what we want. A more complicated case is cgroup isn't idle when queue is in LIMIT_HIGH. But when queue gets upgraded to higher state, other cgroups could dispatch more IO and this cgroup can't dispatch enough IO, so the cgroup is below its high limit and looks like idle (fake idle). In this case, the queue should downgrade soon. The key to determine if we should do downgrade is to detect if cgroup is truely idle. Unfortunately it's very hard to determine if a cgroup is real idle. This patch uses the 'think time check' idea from CFQ for the purpose. Please note, the idea doesn't work for all workloads. For example, a workload with io depth 8 has disk utilization 100%, hence think time is 0, eg, not idle. But the workload can run higher bandwidth with io depth 16. Compared to io depth 16, the io depth 8 workload is idle. We use the idea to roughly determine if a cgroup is idle. We treat a cgroup idle if its think time is above a threshold (by default 50us for SSD and 1ms for HD). The idea is think time above the threshold will start to harm performance. HD is much slower so a longer think time is ok. Signed-off-by: Shaohua Li <shli@fb.com> --- block/bio.c | 2 ++ block/blk-throttle.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++- block/blk.h | 2 ++ include/linux/blk_types.h | 1 + 4 files changed, 76 insertions(+), 1 deletion(-)