From patchwork Mon Oct 3 21:20:28 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shaohua Li X-Patchwork-Id: 9360941 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 8BD506075E for ; Mon, 3 Oct 2016 21:21:09 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8108B28842 for ; Mon, 3 Oct 2016 21:21:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 75E4F28A74; Mon, 3 Oct 2016 21:21:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E9D7528842 for ; Mon, 3 Oct 2016 21:21:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753510AbcJCVVG (ORCPT ); Mon, 3 Oct 2016 17:21:06 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:33665 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753232AbcJCVVD (ORCPT ); Mon, 3 Oct 2016 17:21:03 -0400 Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.17/8.16.0.17) with SMTP id u93LJOJ7031421 for ; Mon, 3 Oct 2016 14:21:02 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=q7C56jU0hpirW4Rlj5qxkXUppRQPRsNmNA8rVrNFJkI=; b=BQRoPOXgiF9QCVaVMfLW1Xey8jY0adelfh3k+rH3ytQzqaQm3QrygnhezAWsruoNBbao zFST2kmIGhaoCGzVY6Bp/k2jz8tje6pWSYAW6fyP3jxZJtCIjrO3zkE7n/h/UvlR6PUG 5vYi/TcE/KCnUuMZEgXF/cZjPqFJLj9bKdc= Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 25uwpd8s6u-12 (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Mon, 03 Oct 2016 14:21:02 -0700 Received: from mx-out.facebook.com (192.168.52.123) by PRN-CHUB07.TheFacebook.com (192.168.16.17) with Microsoft SMTP Server (TLS) id 14.3.294.0; Mon, 3 Oct 2016 14:20:34 -0700 Received: from facebook.com (2401:db00:21:603d:face:0:19:0) by mx-out.facebook.com (10.223.100.97) with ESMTP id 380e715e89af11e6a23724be0593f280-71df7a50 for ; Mon, 03 Oct 2016 14:20:33 -0700 Received: by devbig638.prn2.facebook.com (Postfix, from userid 11222) id 90D9742C11AD; Mon, 3 Oct 2016 14:20:31 -0700 (PDT) From: Shaohua Li To: , CC: , , , , Subject: [PATCH v3 09/11] block-throttle: make bandwidth change smooth Date: Mon, 3 Oct 2016 14:20:28 -0700 Message-ID: X-Mailer: git-send-email 2.9.3 In-Reply-To: References: X-FB-Internal: Safe MIME-Version: 1.0 X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2016-10-03_12:, , signatures=0 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When cgroups all reach high limit, cgroups can dispatch more IO. This could make some cgroups dispatch more IO but others not, and even some cgroups could dispatch less IO than their high limit. For example, cg1 high limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is 120M/s for the workload. Their bps could something like this: cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80 At T1, all cgroups reach high limit, so they can dispatch more IO later. Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO than its high limit 80M/s, we downgrade the queue from LIMIT_MAX to LIMIT_HIGH, then all cgroups are throttled to their high limit (T3). cg2 will have bandwidth below its high limit at most time. The big problem here is we don't know the maximum bandwidth of the workload, so we can't make smart decision to avoid the situation. This patch makes cgroup bandwidth change smooth. After disk upgrades from LIMIT_HIGH to LIMIT_MAX, we don't allow cgroups use all bandwidth upto their max limit immediately. Their bandwidth limit will be increased gradually to avoid above situation. So above example will became something like: cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80 -> 45/75 -> 10/80 In this way cgroups bandwidth will be above their limit in majority time, this still doesn't fully utilize disk bandwidth, but that's something we pay for sharing. Note this doesn't completely avoid cgroup running under its high limit. The best way to guarantee cgroup doesn't run under its limit is to set max limit. For example, if we set cg1 max limit to 40, cg2 will never run under its high limit. Signed-off-by: Shaohua Li --- block/blk-throttle.c | 44 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 40 insertions(+), 4 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 759bce1..1658b13 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -170,6 +170,8 @@ struct throtl_data unsigned long high_upgrade_time; unsigned long high_downgrade_time; + + unsigned int scale; }; static void throtl_pending_timer_fn(unsigned long arg); @@ -224,12 +226,28 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq) static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw) { struct blkcg_gq *blkg = tg_to_blkg(tg); + struct throtl_data *td; uint64_t ret; if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent) return -1; - ret = tg->bps[rw][tg->td->limit_index]; - if (ret == -1 && tg->td->limit_index == LIMIT_HIGH) + td = tg->td; + ret = tg->bps[rw][td->limit_index]; + if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_HIGH] != -1) { + uint64_t increase; + + if (td->scale < 4096 && time_after_eq(jiffies, + td->high_upgrade_time + td->scale * td->throtl_slice)) { + uint64_t time = jiffies - td->high_upgrade_time; + + do_div(time, td->throtl_slice); + td->scale = time; + } + increase = (tg->bps[rw][LIMIT_HIGH] >> 1) * td->scale; + ret = min(tg->bps[rw][LIMIT_MAX], + tg->bps[rw][LIMIT_HIGH] + increase); + } + if (ret == -1 && td->limit_index == LIMIT_HIGH) return tg->bps[rw][LIMIT_MAX]; return ret; @@ -238,12 +256,29 @@ static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw) static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw) { struct blkcg_gq *blkg = tg_to_blkg(tg); + struct throtl_data *td; unsigned int ret; if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent) return -1; - ret = tg->iops[rw][tg->td->limit_index]; - if (ret == -1 && tg->td->limit_index == LIMIT_HIGH) + td = tg->td; + ret = tg->iops[rw][td->limit_index]; + if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_HIGH] != -1) { + uint64_t increase; + + if (td->scale < 4096 && time_after_eq(jiffies, + td->high_upgrade_time + td->scale * td->throtl_slice)) { + uint64_t time = jiffies - td->high_upgrade_time; + + do_div(time, td->throtl_slice); + td->scale = time; + } + + increase = (tg->iops[rw][LIMIT_HIGH] >> 1) * td->scale; + ret = min(tg->iops[rw][LIMIT_MAX], + tg->iops[rw][LIMIT_HIGH] + (unsigned int)increase); + } + if (ret == -1 && td->limit_index == LIMIT_HIGH) return tg->iops[rw][LIMIT_MAX]; return ret; } @@ -1671,6 +1706,7 @@ static void throtl_upgrade_state(struct throtl_data *td) td->limit_index = LIMIT_MAX; td->high_upgrade_time = jiffies; + td->scale = 0; rcu_read_lock(); blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) { struct throtl_grp *tg = blkg_to_tg(blkg);