From patchwork Sat Dec 30 20:32:04 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Timofey Titovets X-Patchwork-Id: 10137909 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 9E7796037D for ; Sat, 30 Dec 2017 20:32:17 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 88DFC2874C for ; Sat, 30 Dec 2017 20:32:17 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7C602287A5; Sat, 30 Dec 2017 20:32:17 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DDA782874C for ; Sat, 30 Dec 2017 20:32:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750927AbdL3UcO (ORCPT ); Sat, 30 Dec 2017 15:32:14 -0500 Received: from mail-wm0-f68.google.com ([74.125.82.68]:45135 "EHLO mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750865AbdL3UcN (ORCPT ); Sat, 30 Dec 2017 15:32:13 -0500 Received: by mail-wm0-f68.google.com with SMTP id 9so52868901wme.4 for ; Sat, 30 Dec 2017 12:32:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=v1bG9yqsfSw8nbOq8AYomeA8vB7beWmaKRx7wp4KTDU=; b=coDRkZmK2J14im3NWhc/IJMNZOJFhO79zcLZXIHHYcnbiI1t02TibQtkSEv2sezKgz zzad1YGUbUM7BMr7LYtyG62n6lAmguohSuxoJJO3vE/ZNug98s2Nh9enUFmersHha06y y2hYwUZTRbcG51fU8PXmXUHXoCcKNw1nQb3etDSK07e6rtif1CBhL/aknMHA7t+MhftG Q983S6sOP1oWV4dwvoFh34BtrEPMO0Da+ME7mF/hHT9UbqmhaS+GBUuw0ft8flMjcFW5 irtH08RIFAvQQf23YJQuuzRnhldh9BdZDwkKseYiV1Dp9GMt0XMsA5015EMh+aGl1FAU h0hg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=v1bG9yqsfSw8nbOq8AYomeA8vB7beWmaKRx7wp4KTDU=; b=g3QvxBY0KsKHSJuXlMyyuzaLoZseQMSF1h/1+OGHvj+Gk79o6nXMiyuaJe0PtN30Tb kmfR+nW24BuSrckttOanLViCWUqxBtKLz9vJkaNHgheJXLwDoQGSQyZEM8wYroqNqjP1 xWTxxJSTThndVTg3HWz28f2KIEtBHGrhmjX4anzugzaEOSnEnq+brmqFqsKKN8EIdE/S DoqrDc/+JESjv+7X2TVxP22blstGrKxbYydn384fMgrEXbDaZGWvLABNgIhpPy1L2LgJ Zf5WB/Ku3H85JqoHPa87kHs8Ot8nu8MRlSfDH4Lkhva1lp3Piexd7uBGswvNfz5lMVyi 9E/g== X-Gm-Message-State: AKGB3mLG384sf7f44mk4V/gih6sszj95jmED1c5gaUlaeMMYjZtF5H3B ozc9gCUDw1tIENWbH1f4kmaXPQ== X-Google-Smtp-Source: ACJfBovqBtCZbeUyXXiH36bMb8AA54vIsCdGqwmaFZSQQD7s8l33owaE5jUdB3zM0PP2r7ErqPJFDw== X-Received: by 10.28.0.193 with SMTP id 184mr32764312wma.58.1514665931740; Sat, 30 Dec 2017 12:32:11 -0800 (PST) Received: from titovetst-beplan.lan (nat6-minsk-pool-46-53-208-190.telecom.by. [46.53.208.190]) by smtp.gmail.com with ESMTPSA id m133sm31574195wmd.40.2017.12.30.12.32.10 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 30 Dec 2017 12:32:10 -0800 (PST) From: Timofey Titovets To: linux-btrfs@vger.kernel.org Cc: Timofey Titovets Subject: [PATCH V3] Btrfs: enchanse raid1/10 balance heuristic Date: Sat, 30 Dec 2017 23:32:04 +0300 Message-Id: <20171230203204.13151-1-nefelim4ag@gmail.com> X-Mailer: git-send-email 2.15.1 MIME-Version: 1.0 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Currently btrfs raid1/10 balancer bаlance requests to mirrors, based on pid % num of mirrors. Make logic understood: - if one of underline devices are non rotational - Queue leght to underline devices By default try use pid % num_mirrors guessing, but: - If one of mirrors are non rotational, repick optimal to it - If underline mirror have less queue leght then optimal, repick to that mirror For avoid round-robin request balancing, lets round down queue leght: - By 8 for rotational devs - By 2 for all non rotational devs Changes: v1 -> v2: - Use helper part_in_flight() from genhd.c to get queue lenght - Move guess code to guess_optimal() - Change balancer logic, try use pid % mirror by default Make balancing on spinning rust if one of underline devices are overloaded v2 -> v3: - Fix arg for RAID10 - use sub_stripes, instead of num_stripes Signed-off-by: Timofey Titovets Reviewed-by: Dmitrii Tcvetkov Tested-by: Dmitrii Tcvetkov --- block/genhd.c | 1 + fs/btrfs/volumes.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 114 insertions(+), 2 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 96a66f671720..a7742bbbb6a7 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part, atomic_read(&part->in_flight[1]); } } +EXPORT_SYMBOL_GPL(part_in_flight); struct hd_struct *__disk_get_part(struct gendisk *disk, int partno) { diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 49810b70afd3..a3b80ba31d4d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +/** + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev + * + * @bdev: target bdev + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 + */ +static int bdev_get_queue_len(struct block_device *bdev, int round_down) +{ + int sum; + struct hd_struct *bd_part = bdev->bd_part; + struct request_queue *rq = bdev_get_queue(bdev); + uint32_t inflight[2] = {0, 0}; + + part_in_flight(rq, bd_part, inflight); + + sum = max_t(uint32_t, inflight[0], inflight[1]); + + /* + * Try prevent switch for every sneeze + * By roundup output num by some value + */ + return ALIGN_DOWN(sum, round_down); +} + +/** + * guess_optimal - return guessed optimal mirror + * + * Optimal expected to be pid % num_stripes + * + * That's generaly ok for spread load + * Add some balancer based on queue leght to device + * + * Basic ideas: + * - Sequential read generate low amount of request + * so if load of drives are equal, use pid % num_stripes balancing + * - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal + * and repick if other dev have "significant" less queue lenght + * - Repick optimal if queue leght of other mirror are less + */ +static int guess_optimal(struct map_lookup *map, int num, int optimal) +{ + int i; + int round_down = 8; + int qlen[num]; + bool is_nonrot[num]; + bool all_bdev_nonrot = true; + bool all_bdev_rotate = true; + struct block_device *bdev; + + if (num == 1) + return optimal; + + /* Check accessible bdevs */ + for (i = 0; i < num; i++) { + /* Init for missing bdevs */ + is_nonrot[i] = false; + qlen[i] = INT_MAX; + bdev = map->stripes[i].dev->bdev; + if (bdev) { + qlen[i] = 0; + is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev)); + if (is_nonrot[i]) + all_bdev_rotate = false; + else + all_bdev_nonrot = false; + } + } + + /* + * Don't bother with computation + * if only one of two bdevs are accessible + */ + if (num == 2 && qlen[0] != qlen[1]) { + if (qlen[0] < qlen[1]) + return 0; + else + return 1; + } + + if (all_bdev_nonrot) + round_down = 2; + + for (i = 0; i < num; i++) { + if (qlen[i]) + continue; + bdev = map->stripes[i].dev->bdev; + qlen[i] = bdev_get_queue_len(bdev, round_down); + } + + /* For mixed case, pick non rotational dev as optimal */ + if (all_bdev_rotate == all_bdev_nonrot) { + for (i = 0; i < num; i++) { + if (is_nonrot[i]) + optimal = i; + } + } + + for (i = 0; i < num; i++) { + if (qlen[optimal] > qlen[i]) + optimal = i; + } + + return optimal; +} + static int find_live_mirror(struct btrfs_fs_info *fs_info, struct map_lookup *map, int first, int num, int optimal, int dev_replace_is_ongoing) @@ -5601,6 +5707,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int i; int ret = 0; int num_stripes; + int optimal; int max_errors = 0; int tgtdev_indexes = 0; struct btrfs_bio *bbio = NULL; @@ -5713,9 +5820,11 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, else if (mirror_num) stripe_index = mirror_num - 1; else { + optimal = guess_optimal(map, map->num_stripes, + current->pid % map->num_stripes); stripe_index = find_live_mirror(fs_info, map, 0, map->num_stripes, - current->pid % map->num_stripes, + optimal, dev_replace_is_ongoing); mirror_num = stripe_index + 1; } @@ -5741,10 +5850,12 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, stripe_index += mirror_num - 1; else { int old_stripe_index = stripe_index; + optimal = guess_optimal(map, map->sub_stripes, + current->pid % map->sub_stripes); stripe_index = find_live_mirror(fs_info, map, stripe_index, map->sub_stripes, stripe_index + - current->pid % map->sub_stripes, + optimal, dev_replace_is_ongoing); mirror_num = stripe_index - old_stripe_index + 1; }