[RFC,0/6] Add roundrobin raid1 read policy

Message ID	20210209203041.21493-1-mrostecki@suse.de (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Michal Rostecki <mrostecki@suse.de> To: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.com>, linux-btrfs@vger.kernel.org (open list:BTRFS FILE SYSTEM), linux-kernel@vger.kernel.org (open list) Cc: Michal Rostecki <mrostecki@suse.com> Subject: [PATCH RFC 0/6] Add roundrobin raid1 read policy Date: Tue, 9 Feb 2021 21:30:34 +0100 Message-Id: <20210209203041.21493-1-mrostecki@suse.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Add roundrobin raid1 read policy \| expand [RFC,0/6] Add roundrobin raid1 read policy [RFC,1/6] btrfs: Add inflight BIO request counter [RFC,2/6] btrfs: Store the last device I/O offset [RFC,3/6] btrfs: Add stripe_physical function [RFC,4/6] btrfs: Check if the filesystem is has mixed type of devices [RFC,5/6] btrfs: sysfs: Add directory for read policies [RFC,6/6] btrfs: Add roundrobin raid1 read policy

Message ID

20210209203041.21493-1-mrostecki@suse.de (mailing list archive)

Headers

From: Michal Rostecki <mrostecki@suse.de>
To: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
        David Sterba <dsterba@suse.com>,
        linux-btrfs@vger.kernel.org (open list:BTRFS FILE SYSTEM),
        linux-kernel@vger.kernel.org (open list)
Cc: Michal Rostecki <mrostecki@suse.com>
Subject: [PATCH RFC 0/6] Add roundrobin raid1 read policy
Date: Tue,  9 Feb 2021 21:30:34 +0100
Message-Id: <20210209203041.21493-1-mrostecki@suse.de>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Add roundrobin raid1 read policy | expand

Message

Michal Rostecki Feb. 9, 2021, 8:30 p.m. UTC

From: Michal Rostecki <mrostecki@suse.com>

This patch series adds a new raid1 read policy - roundrobin. For each
request, it selects the mirror which has lower load than queue depth.
Load is defined  as the number of inflight requests + a penalty value
(if the scheduled request is not local to the last processed request for
a rotational disk).

The series consists of preparational changes which add necessary
information to the btrfs_device struct and the change with the policy.

This policy was tested with fio and compared with the default `pid`
policy.

The singlethreaded test has the following parameters:

  [global]
  name=btrfs-raid1-seqread
  filename=btrfs-raid1-seqread
  rw=read
  bs=64k
  direct=0
  numjobs=1
  time_based=0

  [file1]
  size=10G
  ioengine=libaio

and shows the following results:

- raid1c3 with 3 HDDs:
  3 x Segate Barracuda ST2000DM008 (2TB)
  * pid policy
    READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s),
    io=10.0GiB (10.7GB), run=47082-47082msec
  * roundrobin policy
    READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s),
    io=10.0GiB (10.7GB), run=25028-25028mse
- raid1c3 with 2 HDDs and 1 SSD:
  2 x Segate Barracuda ST2000DM008 (2TB)
  1 x Crucial CT256M550SSD1 (256GB)
  * pid policy (the worst case when only HDDs were chosen)
    READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s),
    io=10.0GiB (10.7GB), run=46577-46577mse
  * pid policy (the best case when SSD was used as well)
    READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s),
    io=10.0GiB (10.7GB), run=19954-19954msec
  * roundrobin (there are no noticeable differences when testing multiple
    times)
    READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s),
    io=10.0GiB (10.7GB), run=18933-18933msec

The multithreaded test has the following parameters:

  [global]
  name=btrfs-raid1-seqread
  filename=btrfs-raid1-seqread
  rw=read
  bs=64k
  direct=0
  numjobs=8
  time_based=0

  [file1]
  size=10G
  ioengine=libaio

and shows the following results:

- raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB)
  3 x Segate Barracuda ST2000DM008 (2TB)
  * pid policy
    READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s),
    io=80.0GiB (85.9GB), run=52210-52211msec
  * roundrobin
    READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s),
    io=80.0GiB (85.9GB), run=47269-47271msec
- raid1c3 with 2 HDDs and 1 SSD:
  2 x Segate Barracuda ST2000DM008 (2TB)
  1 x Crucial CT256M550SSD1 (256GB)
  * pid policy
    READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
    io=80.0GiB (85.9GB), run=44449-44450msec
  * roundrobin
    READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s),
    io=80.0GiB (85.9GB), run=32969-32970msec

To measure the performance of each policy and find optimal penalty
values, I created scripts which are available here:

https://gitlab.com/vadorovsky/btrfs-perf
https://github.com/mrostecki/btrfs-perf

Michal Rostecki (6):
  btrfs: Add inflight BIO request counter
  btrfs: Store the last device I/O offset
  btrfs: Add stripe_physical function
  btrfs: Check if the filesystem is has mixed type of devices
  btrfs: sysfs: Add directory for read policies
  btrfs: Add roundrobin raid1 read policy

 fs/btrfs/ctree.h   |   3 +
 fs/btrfs/disk-io.c |   3 +
 fs/btrfs/sysfs.c   | 144 ++++++++++++++++++++++++++----
 fs/btrfs/volumes.c | 218 +++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h |  22 +++++
 5 files changed, 366 insertions(+), 24 deletions(-)

Comments

Anand Jain Feb. 10, 2021, 6:52 a.m. UTC | #1

On 10/02/2021 04:30, Michal Rostecki wrote:
> From: Michal Rostecki <mrostecki@suse.com>
> 
> This patch series adds a new raid1 read policy - roundrobin. For each
> request, it selects the mirror which has lower load than queue depth.
> Load is defined  as the number of inflight requests + a penalty value
> (if the scheduled request is not local to the last processed request for
> a rotational disk).
> 
> The series consists of preparational changes which add necessary
> information to the btrfs_device struct and the change with the policy.
> 
> This policy was tested with fio and compared with the default `pid`
> policy.
> 
> The singlethreaded test has the following parameters:
> 
>    [global]
>    name=btrfs-raid1-seqread
>    filename=btrfs-raid1-seqread
>    rw=read
>    bs=64k
>    direct=0
>    numjobs=1
>    time_based=0
> 
>    [file1]
>    size=10G
>    ioengine=libaio
> 
> and shows the following results:
> 
> - raid1c3 with 3 HDDs:
>    3 x Segate Barracuda ST2000DM008 (2TB)
>    * pid policy
>      READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s),
>      io=10.0GiB (10.7GB), run=47082-47082msec
>    * roundrobin policy
>      READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s),
>      io=10.0GiB (10.7GB), run=25028-25028mse


> - raid1c3 with 2 HDDs and 1 SSD:
>    2 x Segate Barracuda ST2000DM008 (2TB)
>    1 x Crucial CT256M550SSD1 (256GB)
>    * pid policy (the worst case when only HDDs were chosen)
>      READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s),
>      io=10.0GiB (10.7GB), run=46577-46577mse
>    * pid policy (the best case when SSD was used as well)
>      READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s),
>      io=10.0GiB (10.7GB), run=19954-19954msec
>    * roundrobin (there are no noticeable differences when testing multiple
>      times)
>      READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s),
>      io=10.0GiB (10.7GB), run=18933-18933msec
> 
> The multithreaded test has the following parameters:
> 
>    [global]
>    name=btrfs-raid1-seqread
>    filename=btrfs-raid1-seqread
>    rw=read
>    bs=64k
>    direct=0
>    numjobs=8
>    time_based=0
> 
>    [file1]
>    size=10G
>    ioengine=libaio
> 
> and shows the following results:
> 
> - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB)
>    3 x Segate Barracuda ST2000DM008 (2TB)
>    * pid policy
>      READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s),
>      io=80.0GiB (85.9GB), run=52210-52211msec
>    * roundrobin
>      READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s),
>      io=80.0GiB (85.9GB), run=47269-47271msec
> - raid1c3 with 2 HDDs and 1 SSD:
>    2 x Segate Barracuda ST2000DM008 (2TB)
>    1 x Crucial CT256M550SSD1 (256GB)
>    * pid policy
>      READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
>      io=80.0GiB (85.9GB), run=44449-44450msec
>    * roundrobin
>      READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s),
>      io=80.0GiB (85.9GB), run=32969-32970msec
> 

  Both of the above test cases are sequential. How about some random IO
  workload?

  Also, the seek time for non-rotational devices does not exist. So it is
  a good idea to test with ssd + nvme and all nvme or all ssd.

> To measure the performance of each policy and find optimal penalty
> values, I created scripts which are available here:
> 
> https://gitlab.com/vadorovsky/btrfs-perf
> https://github.com/mrostecki/btrfs-perf
> 
> Michal Rostecki (6):


>    btrfs: Add inflight BIO request counter
>    btrfs: Store the last device I/O offset

These patches look good. But as only round-robin policy requires
to monitor the inflight and last-offset. Could you bring them under
if policy=roundrobin? Otherwise, it is just a waste of CPU cycles
if the policy != roundrobin.

 >    btrfs: Add stripe_physical function
 >    btrfs: Check if the filesystem is has mixed type of devices

Thanks, Anand

>    btrfs: sysfs: Add directory for read policies
>    btrfs: Add roundrobin raid1 read policy
> 
>   fs/btrfs/ctree.h   |   3 +
>   fs/btrfs/disk-io.c |   3 +
>   fs/btrfs/sysfs.c   | 144 ++++++++++++++++++++++++++----
>   fs/btrfs/volumes.c | 218 +++++++++++++++++++++++++++++++++++++++++++--
>   fs/btrfs/volumes.h |  22 +++++
>   5 files changed, 366 insertions(+), 24 deletions(-)
>

Michal Rostecki Feb. 10, 2021, 12:18 p.m. UTC | #2

On Wed, Feb 10, 2021 at 02:52:01PM +0800, Anand Jain wrote:
> On 10/02/2021 04:30, Michal Rostecki wrote:
> > From: Michal Rostecki <mrostecki@suse.com>
> > 
> > This patch series adds a new raid1 read policy - roundrobin. For each
> > request, it selects the mirror which has lower load than queue depth.
> > Load is defined  as the number of inflight requests + a penalty value
> > (if the scheduled request is not local to the last processed request for
> > a rotational disk).
> > 
> > The series consists of preparational changes which add necessary
> > information to the btrfs_device struct and the change with the policy.
> > 
> > This policy was tested with fio and compared with the default `pid`
> > policy.
> > 
> > The singlethreaded test has the following parameters:
> > 
> >    [global]
> >    name=btrfs-raid1-seqread
> >    filename=btrfs-raid1-seqread
> >    rw=read
> >    bs=64k
> >    direct=0
> >    numjobs=1
> >    time_based=0
> > 
> >    [file1]
> >    size=10G
> >    ioengine=libaio
> > 
> > and shows the following results:
> > 
> > - raid1c3 with 3 HDDs:
> >    3 x Segate Barracuda ST2000DM008 (2TB)
> >    * pid policy
> >      READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s),
> >      io=10.0GiB (10.7GB), run=47082-47082msec
> >    * roundrobin policy
> >      READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s),
> >      io=10.0GiB (10.7GB), run=25028-25028mse
> 
> 
> > - raid1c3 with 2 HDDs and 1 SSD:
> >    2 x Segate Barracuda ST2000DM008 (2TB)
> >    1 x Crucial CT256M550SSD1 (256GB)
> >    * pid policy (the worst case when only HDDs were chosen)
> >      READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s),
> >      io=10.0GiB (10.7GB), run=46577-46577mse
> >    * pid policy (the best case when SSD was used as well)
> >      READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s),
> >      io=10.0GiB (10.7GB), run=19954-19954msec
> >    * roundrobin (there are no noticeable differences when testing multiple
> >      times)
> >      READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s),
> >      io=10.0GiB (10.7GB), run=18933-18933msec
> > 
> > The multithreaded test has the following parameters:
> > 
> >    [global]
> >    name=btrfs-raid1-seqread
> >    filename=btrfs-raid1-seqread
> >    rw=read
> >    bs=64k
> >    direct=0
> >    numjobs=8
> >    time_based=0
> > 
> >    [file1]
> >    size=10G
> >    ioengine=libaio
> > 
> > and shows the following results:
> > 
> > - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB)
> >    3 x Segate Barracuda ST2000DM008 (2TB)
> >    * pid policy
> >      READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s),
> >      io=80.0GiB (85.9GB), run=52210-52211msec
> >    * roundrobin
> >      READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s),
> >      io=80.0GiB (85.9GB), run=47269-47271msec
> > - raid1c3 with 2 HDDs and 1 SSD:
> >    2 x Segate Barracuda ST2000DM008 (2TB)
> >    1 x Crucial CT256M550SSD1 (256GB)
> >    * pid policy
> >      READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
> >      io=80.0GiB (85.9GB), run=44449-44450msec
> >    * roundrobin
> >      READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s),
> >      io=80.0GiB (85.9GB), run=32969-32970msec
> > 
> 
>  Both of the above test cases are sequential. How about some random IO
>  workload?
> 
>  Also, the seek time for non-rotational devices does not exist. So it is
>  a good idea to test with ssd + nvme and all nvme or all ssd.
> 

Good idea. I will test random I/O and will try to test all-nvme /
all-ssd and mixed nonrot.

> > To measure the performance of each policy and find optimal penalty
> > values, I created scripts which are available here:
> > 
> > https://gitlab.com/vadorovsky/btrfs-perf
> > https://github.com/mrostecki/btrfs-perf
> > 
> > Michal Rostecki (6):
> 
> 
> >    btrfs: Add inflight BIO request counter
> >    btrfs: Store the last device I/O offset
> 
> These patches look good. But as only round-robin policy requires
> to monitor the inflight and last-offset. Could you bring them under
> if policy=roundrobin? Otherwise, it is just a waste of CPU cycles
> if the policy != roundrobin.
> 

If I bring those stats under if policy=roundrobin, they are going to be
inaccurate if someone switches policies on the running system, after
doing any I/O in that filesystem.

I'm open to suggestions how can I make those stats as lightweight as
possible. Unfortunately, I don't think I can store the last physical
location without atomic_t.

The BIO percpu counter is probably the least to be worried about, though
I could maybe get rid of it entirely in favor of using part_stat_read().

> >    btrfs: Add stripe_physical function
> >    btrfs: Check if the filesystem is has mixed type of devices
> 
> Thanks, Anand
> 
> >    btrfs: sysfs: Add directory for read policies
> >    btrfs: Add roundrobin raid1 read policy
> > 
> >   fs/btrfs/ctree.h   |   3 +
> >   fs/btrfs/disk-io.c |   3 +
> >   fs/btrfs/sysfs.c   | 144 ++++++++++++++++++++++++++----
> >   fs/btrfs/volumes.c | 218 +++++++++++++++++++++++++++++++++++++++++++--
> >   fs/btrfs/volumes.h |  22 +++++
> >   5 files changed, 366 insertions(+), 24 deletions(-)
> > 
> 

Thanks for review,
Michal

Michal Rostecki Feb. 10, 2021, 2 p.m. UTC | #3

On Wed, Feb 10, 2021 at 12:18:53PM +0000, Michal Rostecki wrote:
> > These patches look good. But as only round-robin policy requires
> > to monitor the inflight and last-offset. Could you bring them under
> > if policy=roundrobin? Otherwise, it is just a waste of CPU cycles
> > if the policy != roundrobin.
> > 
> 
> If I bring those stats under if policy=roundrobin, they are going to be
> inaccurate if someone switches policies on the running system, after
> doing any I/O in that filesystem.
> 
> I'm open to suggestions how can I make those stats as lightweight as
> possible. Unfortunately, I don't think I can store the last physical
> location without atomic_t.
> 
> The BIO percpu counter is probably the least to be worried about, though
> I could maybe get rid of it entirely in favor of using part_stat_read().
> 

Actually, after thinking about that more, I'm wondering if I should just
drop the last-offset stat and penalty mechanism entirely. They seem to
improve performance slightly only on mixed workloads (thought I need to
check if that's the case in all-SDD od all NVMe arrays), but still
perform worse than policies that you proposed.

Maybe it'd be better if I just focus on getting the best performance on
non-mixed environments in my policy and thus stick to the simple
`inflight < queue_depth` check...

Thanks,
Michal