[V9] Btrfs: enhance raid1/10 balance heuristic
diff mbox series

Message ID 20190506143740.26614-1-timofey.titovets@synesis.ru
State New
Headers show
Series
  • [V9] Btrfs: enhance raid1/10 balance heuristic
Related show

Commit Message

Timofey Titovets May 6, 2019, 2:37 p.m. UTC
From: Timofey Titovets <nefelim4ag@gmail.com>

Currently btrfs raid1/10 bаlance requests to mirrors,
based on pid % num of mirrors.

Add logic to consider:
 - If one of devices are non rotational
 - Queue length to devices

By default pid % num_mirrors guessing will be used, but:
 - If one of mirrors are non rotational,
   reroute requests to non rotational
 - If other mirror have less queue length then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets use abs diff of queue length,
and only if diff greater then threshold try rebalance:
 - threshold 8 for rotational devs
 - threshold 2 for all non rotational devs

Some bench results from mail list
(Dmitrii Tcvetkov <demfloro@demfloro.ru>):
Benchmark summary (arithmetic mean of 3 runs):
         Mainline     Patch
------------------------------------
RAID1  | 18.9 MiB/s | 26.5 MiB/s
RAID10 | 30.7 MiB/s | 30.7 MiB/s
------------------------------------------------------------------------
mainline, fio got lucky to read from first HDD (quite slow HDD):
Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
  read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
  lat (msec): min=2, max=825, avg=60.17, stdev=65.06
------------------------------------------------------------------------
mainline, fio got lucky to read from second HDD (much more modern):
Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
  read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
  lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56
------------------------------------------------------------------------
mainline, fio got lucky to read from an SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
  read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec)
  lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36
------------------------------------------------------------------------
With the patch, 2 HDDs:
Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
  read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
  lat (usec): min=435, max=341037, avg=28511.64, stdev=30000.14
------------------------------------------------------------------------
With the patch, HDD(old one)+SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
  read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
  lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32

Changes:
  v1 -> v2:
    - Use helper part_in_flight() from genhd.c
      to get queue length
    - Move guess code to guess_optimal()
    - Change balancer logic, try use pid % mirror by default
      Make balancing on spinning rust if one of underline devices
      are overloaded
  v2 -> v3:
    - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
  v3 -> v4:
    - Rebase on latest misc-next
  v4 -> v5:
    - Rebase on latest misc-next
  v5 -> v6:
    - Fix spelling
    - Include bench results
  v6 -> v7:
    - Fixes based on Nikolay Borisov review:
      * Assume num == 2
      * Remove "for" loop based on that assumption, where possible
  v7 -> v8:
    - Add comment about magic '2' num in guess function
  v8 -> v9:
    - Rebase on latest misc-next
    - Simplify code
    - Use abs instead of round_down for approximation
      Abs are more fair

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
---
 block/genhd.c      |  1 +
 fs/btrfs/volumes.c | 88 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 88 insertions(+), 1 deletion(-)

Comments

David Sterba May 13, 2019, 6:52 p.m. UTC | #1
On Mon, May 06, 2019 at 05:37:40PM +0300, Timofey Titovets wrote:
> From: Timofey Titovets <nefelim4ag@gmail.com>
> 
> Currently btrfs raid1/10 bаlance requests to mirrors,
> based on pid % num of mirrors.
> 
> Add logic to consider:
>  - If one of devices are non rotational
>  - Queue length to devices
> 
> By default pid % num_mirrors guessing will be used, but:
>  - If one of mirrors are non rotational,
>    reroute requests to non rotational
>  - If other mirror have less queue length then optimal,
>    repick to that mirror
> 
> For avoid round-robin request balancing,
> lets use abs diff of queue length,
> and only if diff greater then threshold try rebalance:
>  - threshold 8 for rotational devs
>  - threshold 2 for all non rotational devs

I started to look at this again, so far I think that the decision logic
should be time and congestion based. The case when the devices are
uncongested is easy and choosing either mirror is fine (semi-randomly).

On congested devices it can lead to some pathological patterns like
switching the mirrors too often, so this is the one I'd focus on.

The congestion is reflected in your patch by the part_in_flight, though
you call it 'queue' that got me first confused as there are the hardware
device queues and block layer queues.

From the filesystem perspective and for our purposes, anything that is
below the fs layer is in-flight.

The requests to select a mirror come from all places (btrfs_map_block)
and are basically unpredictable, so saving position of disk head would
not help to decide.

The balancing mechanism IMHO can rely only on averaging all requests and
doing round robin so that all mirrors are evenly loaded. If there's a
faster device in the mix, it'll process the IO faster and become less
loaded so it'll get more requests in turn.

I'd rather see a generic mechanism that would spread the load than doing
static checks if the device is or is not rotational. Though this is a
significant factor, it's not 100% reliable, eg. NBD or iscsi or other
layers between the physical device and what the fs sees.

So, the time based switching applies last, when the devices are
congested, evenly loaded by previous requests and further io needs to
maintain some balance.

Once this state is detected, time is saved and all new requests go to
one mirror. After a time passes, the mirrors are switched.

Chosing the right time is the question, starting with 1 second but not
than 5 is my first estimate.

Regarding the patches to select mirror policy, that Anand sent, I think
we first should provide a sane default policy that addresses most
commong workloads before we offer an interface for users that see the
need to fiddle with it.

> 
> Some bench results from mail list
> (Dmitrii Tcvetkov <demfloro@demfloro.ru>):
> Benchmark summary (arithmetic mean of 3 runs):
>          Mainline     Patch
> ------------------------------------
> RAID1  | 18.9 MiB/s | 26.5 MiB/s
> RAID10 | 30.7 MiB/s | 30.7 MiB/s
> ------------------------------------------------------------------------
> mainline, fio got lucky to read from first HDD (quite slow HDD):
> Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
>   read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
>   lat (msec): min=2, max=825, avg=60.17, stdev=65.06
> ------------------------------------------------------------------------
> mainline, fio got lucky to read from second HDD (much more modern):
> Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
>   read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
>   lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56
> ------------------------------------------------------------------------
> mainline, fio got lucky to read from an SSD:
> Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
>   read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec)
>   lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36
> ------------------------------------------------------------------------
> With the patch, 2 HDDs:
> Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
>   read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
>   lat (usec): min=435, max=341037, avg=28511.64, stdev=30000.14
> ------------------------------------------------------------------------
> With the patch, HDD(old one)+SSD:
> Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
>   read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
>   lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32
> 
> Changes:
>   v1 -> v2:
>     - Use helper part_in_flight() from genhd.c
>       to get queue length
>     - Move guess code to guess_optimal()
>     - Change balancer logic, try use pid % mirror by default
>       Make balancing on spinning rust if one of underline devices
>       are overloaded
>   v2 -> v3:
>     - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
>   v3 -> v4:
>     - Rebase on latest misc-next
>   v4 -> v5:
>     - Rebase on latest misc-next
>   v5 -> v6:
>     - Fix spelling
>     - Include bench results
>   v6 -> v7:
>     - Fixes based on Nikolay Borisov review:
>       * Assume num == 2
>       * Remove "for" loop based on that assumption, where possible
>   v7 -> v8:
>     - Add comment about magic '2' num in guess function
>   v8 -> v9:
>     - Rebase on latest misc-next
>     - Simplify code
>     - Use abs instead of round_down for approximation
>       Abs are more fair
> 
> Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
> ---
>  block/genhd.c      |  1 +
>  fs/btrfs/volumes.c | 88 +++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 88 insertions(+), 1 deletion(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 703267865f14..fb35c85a7f42 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -84,6 +84,7 @@ unsigned int part_in_flight(struct request_queue *q, struct hd_struct *part)
>  
>  	return inflight;
>  }
> +EXPORT_SYMBOL_GPL(part_in_flight);

This needs to be acked by block layer people, unless we find another way
to get the status of the devic regarding congestion. There is
bdi_congested but it's maybe to coarse for the decisions which device is
more congested.

Alternatively we can keep the stats inside btrfs_device.
waxhead May 13, 2019, 7:57 p.m. UTC | #2
David Sterba wrote:
> On Mon, May 06, 2019 at 05:37:40PM +0300, Timofey Titovets wrote:
>> From: Timofey Titovets <nefelim4ag@gmail.com>
>>
>> Currently btrfs raid1/10 bаlance requests to mirrors,
>> based on pid % num of mirrors.
>>
> 
> Regarding the patches to select mirror policy, that Anand sent, I think
> we first should provide a sane default policy that addresses most
> commong workloads before we offer an interface for users that see the
> need to fiddle with it.
> 
As just a regular btrfs user I would just like to add that I earlier 
made a comment where I think that btrfs should have the ability to 
assign certain DevID's to groups (storage device groups).

 From there I think it would be a good idea to "assign" subvolumes to 
either one (or more) group(s) so that btrfs would prefer (if free space 
permits) to store data from that subvolume on a certain group of storage 
devices.

If you could also set a weight value for read and write separately for a 
group then you are from a humble users point of view good to go and any 
PID% optimization (and management) while very interesting sounds less 
important.

As BTRFS scales to more than 32 devices (I think there is a limit for 30 
or 32????) device groups should really be in there from a management 
point of view and mount options for readmirror policy does not sound 
good the way I understand it as this would affect the fileystem globally.

Groups could also allow for useful features like making sure metadata 
stays on fast devices, migrating hot data to faster groups automatically 
on read, and when (if?) subvolumes support different storage profiles 
"Raid1/10/5/6" it sounds like an even better idea to assign such 
subvolumes to faster/slower groups depending on the storage profile.

Anyway... I just felt like airing some ideas since the readmirror topic 
has come up a few times on the mailing list recently.
Hugo Mills May 13, 2019, 8:03 p.m. UTC | #3
On Mon, May 13, 2019 at 09:57:43PM +0200, waxhead wrote:
> David Sterba wrote:
> >On Mon, May 06, 2019 at 05:37:40PM +0300, Timofey Titovets wrote:
> >>From: Timofey Titovets <nefelim4ag@gmail.com>
> >>
> >>Currently btrfs raid1/10 bаlance requests to mirrors,
> >>based on pid % num of mirrors.
> >>
> >
> >Regarding the patches to select mirror policy, that Anand sent, I think
> >we first should provide a sane default policy that addresses most
> >commong workloads before we offer an interface for users that see the
> >need to fiddle with it.
> >
> As just a regular btrfs user I would just like to add that I earlier
> made a comment where I think that btrfs should have the ability to
> assign certain DevID's to groups (storage device groups).
> 
> From there I think it would be a good idea to "assign" subvolumes to
> either one (or more) group(s) so that btrfs would prefer (if free
> space permits) to store data from that subvolume on a certain group
> of storage devices.
> 
> If you could also set a weight value for read and write separately
> for a group then you are from a humble users point of view good to
> go and any PID% optimization (and management) while very interesting
> sounds less important.
> 
> As BTRFS scales to more than 32 devices (I think there is a limit
> for 30 or 32????) device groups should really be in there from a
> management point of view and mount options for readmirror policy
> does not sound good the way I understand it as this would affect the
> fileystem globally.
> 
> Groups could also allow for useful features like making sure
> metadata stays on fast devices, migrating hot data to faster groups
> automatically on read, and when (if?) subvolumes support different
> storage profiles "Raid1/10/5/6" it sounds like an even better idea
> to assign such subvolumes to faster/slower groups depending on the
> storage profile.
> 
> Anyway... I just felt like airing some ideas since the readmirror
> topic has come up a few times on the mailing list recently.

   I did write up a slightly more concrete proposal on how to do this
algorithmically (plus quite a lot more) some years ago. I even started
implementing it, but I ran into problems of available time and
available kernel mad skillz, neither of which I had enough of.

https://www.spinics.net/lists/linux-btrfs/msg33916.html

   Hugo.
Steven Davies May 14, 2019, 9:11 p.m. UTC | #4
On 2019-05-13 19:52, David Sterba wrote:
> I'd rather see a generic mechanism that would spread the load than 
> doing
> static checks if the device is or is not rotational. Though this is a
> significant factor, it's not 100% reliable, eg. NBD or iscsi or other
> layers between the physical device and what the fs sees.

Is there perhaps a way to measure, or access existing statistics for, 
average read times for each device, and use that to hint at suggested 
devices to read from?

> Regarding the patches to select mirror policy, that Anand sent, I think
> we first should provide a sane default policy that addresses most
> commong workloads before we offer an interface for users that see the
> need to fiddle with it.

Is it necessary to let users choose a (read) mirror policy if the 
options follow a logical progression? For now we could say:

choose_mirror():
   read_mirrors = list_available_read_mirrors() (Anand's patch series)
   if read_mirrors.length == 1 //User specified only one device to read 
from
     return read_mirrors[0]
   if read_mirrors is empty
     read_mirrors = all_available_mirrors()
   return fastest_mirror_from(read_mirrors) (Timofey's patch)

Any other algorithms can be added into the stack in a logical position. 
Perhaps if it gets more complicated then there can be an fs 
property/mount option/tunable to disable one or more algorithms.

Patch
diff mbox series

diff --git a/block/genhd.c b/block/genhd.c
index 703267865f14..fb35c85a7f42 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -84,6 +84,7 @@  unsigned int part_in_flight(struct request_queue *q, struct hd_struct *part)
 
 	return inflight;
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
 		       unsigned int inflight[2])
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1c2a6e4b39da..8671c2bdced6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -13,6 +13,7 @@ 
 #include <linux/raid/pq.h>
 #include <linux/semaphore.h>
 #include <linux/uuid.h>
+#include <linux/genhd.h>
 #include <linux/list_sort.h>
 #include "ctree.h"
 #include "extent_map.h"
@@ -29,6 +30,8 @@ 
 #include "sysfs.h"
 #include "tree-checker.h"
 
+#define BTRFS_RAID_1_10_MAX_MIRRORS 2
+
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 	[BTRFS_RAID_RAID10] = {
 		.sub_stripes	= 2,
@@ -5482,6 +5485,88 @@  int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue length of bdev
+ *
+ * @bdev: target bdev
+ */
+static uint32_t bdev_get_queue_len(struct block_device *bdev)
+{
+	struct hd_struct *bd_part = bdev->bd_part;
+	struct request_queue *rq = bdev_get_queue(bdev);
+
+	return part_in_flight(rq, bd_part);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generally fine for spread load
+ * That are balancer based on queue length to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *    so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
+ *    and repick if other dev have "significant" less queue length
+ *  - Repick optimal if queue length of other mirror are significantly less
+ */
+static int guess_optimal(struct map_lookup *map, int num, int optimal)
+{
+	int i;
+	/*
+	 * Spinning rust not like random request
+	 * Because of that rebalance are less aggressive
+	 */
+	uint32_t rq_len_overload = 8;
+	uint32_t qlen[2];
+	bool is_nonrot[2];
+	struct block_device *bdev;
+
+	/* That function supposed to work with up to 2 mirrors */
+	ASSERT(BTRFS_RAID_1_10_MAX_MIRRORS == 2);
+	ASSERT(BTRFS_RAID_1_10_MAX_MIRRORS == num);
+
+	/* Check accessible bdevs */
+	for (i = 0; i < 2; i++) {
+		bdev = map->stripes[i].dev->bdev;
+		/*
+		 * Don't bother with further computation
+		 * if only one of two bdevs are accessible
+		 */
+		if (!bdev)
+			return (i + 1) % 2;
+
+		qlen[i] = bdev_get_queue_len(bdev);
+		is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
+	}
+
+	/* For mixed case, pick non rotational dev as optimal */
+	if (is_nonrot[0] != is_nonrot[1]) {
+		if (is_nonrot[0])
+			optimal = 0;
+		else
+			optimal = 1;
+	} else {
+		/* For nonrot devices balancing can be aggressive */
+		if (is_nonrot[0])
+			rq_len_overload = 2;
+	}
+
+	/* Devices load at same level */
+	if (abs(qlen[0] - qlen[1]) <= rq_len_overload)
+		return optimal;
+
+	if (qlen[0] > qlen[1])
+		optimal = 1;
+	else
+		optimal = 0;
+
+	return optimal;
+}
+
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
 			    struct map_lookup *map, int first,
 			    int dev_replace_is_ongoing)
@@ -5500,7 +5585,8 @@  static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	else
 		num_stripes = map->num_stripes;
 
-	preferred_mirror = first + current->pid % num_stripes;
+	preferred_mirror = first + guess_optimal(map, num_stripes,
+						 current->pid % num_stripes);
 
 	if (dev_replace_is_ongoing &&
 	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==