From patchwork Sat Dec 30 20:32:04 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Timofey Titovets <nefelim4ag@gmail.com>
X-Patchwork-Id: 10137909
Return-Path: <linux-btrfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	9E7796037D for <patchwork-linux-btrfs@patchwork.kernel.org>;
	Sat, 30 Dec 2017 20:32:17 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 88DFC2874C
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
	Sat, 30 Dec 2017 20:32:17 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 7C602287A5; Sat, 30 Dec 2017 20:32:17 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,
	DKIM_ADSP_CUSTOM_MED,
	DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI,
	T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DDA782874C
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
	Sat, 30 Dec 2017 20:32:16 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1750927AbdL3UcO (ORCPT
	<rfc822;patchwork-linux-btrfs@patchwork.kernel.org>);
	Sat, 30 Dec 2017 15:32:14 -0500
Received: from mail-wm0-f68.google.com ([74.125.82.68]:45135 "EHLO
	mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750865AbdL3UcN (ORCPT
	<rfc822;linux-btrfs@vger.kernel.org>);
	Sat, 30 Dec 2017 15:32:13 -0500
Received: by mail-wm0-f68.google.com with SMTP id 9so52868901wme.4
	for <linux-btrfs@vger.kernel.org>;
	Sat, 30 Dec 2017 12:32:12 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20161025;
	h=from:to:cc:subject:date:message-id:mime-version
	:content-transfer-encoding;
	bh=v1bG9yqsfSw8nbOq8AYomeA8vB7beWmaKRx7wp4KTDU=;
	b=coDRkZmK2J14im3NWhc/IJMNZOJFhO79zcLZXIHHYcnbiI1t02TibQtkSEv2sezKgz
	zzad1YGUbUM7BMr7LYtyG62n6lAmguohSuxoJJO3vE/ZNug98s2Nh9enUFmersHha06y
	y2hYwUZTRbcG51fU8PXmXUHXoCcKNw1nQb3etDSK07e6rtif1CBhL/aknMHA7t+MhftG
	Q983S6sOP1oWV4dwvoFh34BtrEPMO0Da+ME7mF/hHT9UbqmhaS+GBUuw0ft8flMjcFW5
	irtH08RIFAvQQf23YJQuuzRnhldh9BdZDwkKseYiV1Dp9GMt0XMsA5015EMh+aGl1FAU
	h0hg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version
	:content-transfer-encoding;
	bh=v1bG9yqsfSw8nbOq8AYomeA8vB7beWmaKRx7wp4KTDU=;
	b=g3QvxBY0KsKHSJuXlMyyuzaLoZseQMSF1h/1+OGHvj+Gk79o6nXMiyuaJe0PtN30Tb
	kmfR+nW24BuSrckttOanLViCWUqxBtKLz9vJkaNHgheJXLwDoQGSQyZEM8wYroqNqjP1
	xWTxxJSTThndVTg3HWz28f2KIEtBHGrhmjX4anzugzaEOSnEnq+brmqFqsKKN8EIdE/S
	DoqrDc/+JESjv+7X2TVxP22blstGrKxbYydn384fMgrEXbDaZGWvLABNgIhpPy1L2LgJ
	Zf5WB/Ku3H85JqoHPa87kHs8Ot8nu8MRlSfDH4Lkhva1lp3Piexd7uBGswvNfz5lMVyi
	9E/g==
X-Gm-Message-State: AKGB3mLG384sf7f44mk4V/gih6sszj95jmED1c5gaUlaeMMYjZtF5H3B
	ozc9gCUDw1tIENWbH1f4kmaXPQ==
X-Google-Smtp-Source: 
 ACJfBovqBtCZbeUyXXiH36bMb8AA54vIsCdGqwmaFZSQQD7s8l33owaE5jUdB3zM0PP2r7ErqPJFDw==
X-Received: by 10.28.0.193 with SMTP id 184mr32764312wma.58.1514665931740;
	Sat, 30 Dec 2017 12:32:11 -0800 (PST)
Received: from titovetst-beplan.lan
	(nat6-minsk-pool-46-53-208-190.telecom.by. [46.53.208.190])
	by smtp.gmail.com with ESMTPSA id
	m133sm31574195wmd.40.2017.12.30.12.32.10
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Sat, 30 Dec 2017 12:32:10 -0800 (PST)
From: Timofey Titovets <nefelim4ag@gmail.com>
To: linux-btrfs@vger.kernel.org
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Subject: [PATCH V3] Btrfs: enchanse raid1/10 balance heuristic
Date: Sat, 30 Dec 2017 23:32:04 +0300
Message-Id: <20171230203204.13151-1-nefelim4ag@gmail.com>
X-Mailer: git-send-email 2.15.1
MIME-Version: 1.0
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue leght to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue leght then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue leght:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Changes:
  v1 -> v2:
    - Use helper part_in_flight() from genhd.c
      to get queue lenght
    - Move guess code to guess_optimal()
    - Change balancer logic, try use pid % mirror by default
      Make balancing on spinning rust if one of underline devices
      are overloaded
  v2 -> v3:
    - Fix arg for RAID10 - use sub_stripes, instead of num_stripes

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
---
 block/genhd.c      |   1 +
 fs/btrfs/volumes.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 114 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 96a66f671720..a7742bbbb6a7 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part,
 				atomic_read(&part->in_flight[1]);
 	}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
 {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 49810b70afd3..a3b80ba31d4d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -27,6 +27,7 @@
 #include <linux/raid/pq.h>
 #include <linux/semaphore.h>
 #include <linux/uuid.h>
+#include <linux/genhd.h>
 #include <asm/div64.h>
 #include "ctree.h"
 #include "extent_map.h"
@@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+	int sum;
+	struct hd_struct *bd_part = bdev->bd_part;
+	struct request_queue *rq = bdev_get_queue(bdev);
+	uint32_t inflight[2] = {0, 0};
+
+	part_in_flight(rq, bd_part, inflight);
+
+	sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+	/*
+	 * Try prevent switch for every sneeze
+	 * By roundup output num by some value
+	 */
+	return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue leght to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *    so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
+ *    and repick if other dev have "significant" less queue lenght
+ *  - Repick optimal if queue leght of other mirror are less
+ */
+static int guess_optimal(struct map_lookup *map, int num, int optimal)
+{
+	int i;
+	int round_down = 8;
+	int qlen[num];
+	bool is_nonrot[num];
+	bool all_bdev_nonrot = true;
+	bool all_bdev_rotate = true;
+	struct block_device *bdev;
+
+	if (num == 1)
+		return optimal;
+
+	/* Check accessible bdevs */
+	for (i = 0; i < num; i++) {
+		/* Init for missing bdevs */
+		is_nonrot[i] = false;
+		qlen[i] = INT_MAX;
+		bdev = map->stripes[i].dev->bdev;
+		if (bdev) {
+			qlen[i] = 0;
+			is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
+			if (is_nonrot[i])
+				all_bdev_rotate = false;
+			else
+				all_bdev_nonrot = false;
+		}
+	}
+
+	/*
+	 * Don't bother with computation
+	 * if only one of two bdevs are accessible
+	 */
+	if (num == 2 && qlen[0] != qlen[1]) {
+		if (qlen[0] < qlen[1])
+			return 0;
+		else
+			return 1;
+	}
+
+	if (all_bdev_nonrot)
+		round_down = 2;
+
+	for (i = 0; i < num; i++) {
+		if (qlen[i])
+			continue;
+		bdev = map->stripes[i].dev->bdev;
+		qlen[i] = bdev_get_queue_len(bdev, round_down);
+	}
+
+	/* For mixed case, pick non rotational dev as optimal */
+	if (all_bdev_rotate == all_bdev_nonrot) {
+		for (i = 0; i < num; i++) {
+			if (is_nonrot[i])
+				optimal = i;
+		}
+	}
+
+	for (i = 0; i < num; i++) {
+		if (qlen[optimal] > qlen[i])
+			optimal = i;
+	}
+
+	return optimal;
+}
+
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
 			    struct map_lookup *map, int first, int num,
 			    int optimal, int dev_replace_is_ongoing)
@@ -5601,6 +5707,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 	int i;
 	int ret = 0;
 	int num_stripes;
+	int optimal;
 	int max_errors = 0;
 	int tgtdev_indexes = 0;
 	struct btrfs_bio *bbio = NULL;
@@ -5713,9 +5820,11 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 		else if (mirror_num)
 			stripe_index = mirror_num - 1;
 		else {
+			optimal = guess_optimal(map, map->num_stripes,
+					current->pid % map->num_stripes);
 			stripe_index = find_live_mirror(fs_info, map, 0,
 					    map->num_stripes,
-					    current->pid % map->num_stripes,
+					    optimal,
 					    dev_replace_is_ongoing);
 			mirror_num = stripe_index + 1;
 		}
@@ -5741,10 +5850,12 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 			stripe_index += mirror_num - 1;
 		else {
 			int old_stripe_index = stripe_index;
+			optimal = guess_optimal(map, map->sub_stripes,
+					current->pid % map->sub_stripes);
 			stripe_index = find_live_mirror(fs_info, map,
 					      stripe_index,
 					      map->sub_stripes, stripe_index +
-					      current->pid % map->sub_stripes,
+					      optimal,
 					      dev_replace_is_ongoing);
 			mirror_num = stripe_index - old_stripe_index + 1;
 		}