btrfs: balance RAID1/RAID10 mirror selection

Message ID	8541d6d7a63e470b9f4c22ba95cd64fc@waffle.tech (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=kkXa=DX=vger.kernel.org=linux-btrfs-owner@kernel.org> MIME-Version: 1.0 Date: Fri, 16 Oct 2020 05:59:19 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: louis@waffle.tech Message-ID: <8541d6d7a63e470b9f4c22ba95cd64fc@waffle.tech> Subject: [PATCH] btrfs: balance RAID1/RAID10 mirror selection To: linux-btrfs@vger.kernel.org Precedence: bulk
Series	btrfs: balance RAID1/RAID10 mirror selection \| expand btrfs: balance RAID1/RAID10 mirror selection

louis@waffle.tech Oct. 16, 2020, 5:59 a.m. UTC

Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads.

Signed-off-by: Louis Jencka <louis@waffle.tech>
---
 fs/btrfs/volumes.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Nikolay Borisov Oct. 16, 2020, 7:15 a.m. UTC | #1

On 16.10.20 г. 8:59 ч., louis@waffle.tech wrote:
> Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads.
> 
> Signed-off-by: Louis Jencka <louis@waffle.tech>

Can you show numbers substantiating your claims?

Qu Wenruo Oct. 16, 2020, 7:29 a.m. UTC | #2

On 2020/10/16 下午3:15, Nikolay Borisov wrote:
>
>
> On 16.10.20 г. 8:59 ч., louis@waffle.tech wrote:
>> Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads.
>>
>> Signed-off-by: Louis Jencka <louis@waffle.tech>
>
> Can you show numbers substantiating your claims?
>
And isn't this related to the read policy? IIRC some patches are
floating in the mail list to provide the framework of how the read
policy should work.

Those patches aren't yet merged?

Thanks,
Qu

louis@waffle.tech Oct. 16, 2020, 4:18 p.m. UTC | #3

October 16, 2020 1:15 AM, "Nikolay Borisov" <nborisov@suse.com> wrote:

> On 16.10.20 г. 8:59 ч., louis@waffle.tech wrote:
> 
>> Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double
>> throughput for large reads.
>> 
>> Signed-off-by: Louis Jencka <louis@waffle.tech>
> 
> Can you show numbers substantiating your claims?

Sure thing. Below are the results from some tests I've run using a Debian 10 VM. It has two 10GiB disks attached which are each independently limited to 100MB/s total IO (using libvirt's iotune feature). They've been used to create the RAID1 volume mounted at /mnt. I've truncated the fio output to just show the high-level stats.

Reading a large file performs twice as well with the patch applied (203 MB/s vs 101 MB/s). Writing a large file, and the random read-write tests, look like they perform roughly the same to me.

Louis

---

Without patch
=============

louis@debian:/mnt$ uname -a
Linux debian 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64 GNU/Linux

louis@debian:/mnt$ btrfs fi df /mnt
Data, RAID1: total=7.00GiB, used=2.00MiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=1.00GiB, used=688.00KiB
GlobalReserve, single: total=29.19MiB, used=352.00KiB

louis@debian:/mnt$ dd if=/dev/urandom of=/mnt/test bs=1M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 12.928 s, 83.1 MB/s

louis@debian:/mnt$ dd if=/mnt/test of=/dev/null
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 10.6403 s, 101 MB/s

louis@debian:/mnt$ fio --name=random-rw --ioengine=posixaio --rw=randrw --bs=4k --numjobs=2 --size=2g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-rw: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
...
Run status group 0 (all jobs):
   READ: bw=15.8MiB/s (16.5MB/s), 7459KiB/s-8671KiB/s (7638kB/s-8879kB/s), io=964MiB (1010MB), run=61179-61179msec
  WRITE: bw=15.8MiB/s (16.6MB/s), 7490KiB/s-8682KiB/s (7670kB/s-8890kB/s), io=966MiB (1013MB), run=61179-61179msec


With patch
==========

louis@debian:/mnt$ uname -a
Linux debian 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1a~test (2020-10-15) x86_64 GNU/Linux

louis@debian:/mnt$ dd if=/dev/urandom of=/mnt/test bs=1M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.8067 s, 90.9 MB/s

louis@debian:/mnt$ dd if=/mnt/test of=/dev/null
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.28642 s, 203 MB/s

louis@debian:/mnt$ fio --name=random-rw --ioengine=posixaio --rw=randrw --bs=4k --numjobs=2 --size=2g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-rw: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
...
Run status group 0 (all jobs):
   READ: bw=16.5MiB/s (17.3MB/s), 8217KiB/s-8652KiB/s (8414kB/s-8860kB/s), io=1025MiB (1074MB), run=62202-62202msec
  WRITE: bw=16.5MiB/s (17.3MB/s), 8218KiB/s-8698KiB/s (8415kB/s-8907kB/s), io=1028MiB (1077MB), run=62202-62202msec

waxhead Oct. 16, 2020, 5:21 p.m. UTC | #4

louis@waffle.tech wrote:
> Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads.
> 
> Signed-off-by: Louis Jencka <louis@waffle.tech>
> ---
>   fs/btrfs/volumes.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 58b9c419a..45c581d46 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -333,6 +333,9 @@ struct list_head * __attribute_const__ btrfs_get_fs_uuids(void)
>   	return &fs_uuids;
>   }
>   
> +/* Used for round-robin balancing RAID1/RAID10 reads. */
> +atomic_t rr_counter = ATOMIC_INIT(0);
> +
>   /*
>    * alloc_fs_devices - allocate struct btrfs_fs_devices
>    * @fsid:		if not NULL, copy the UUID to fs_devices::fsid
> @@ -5482,7 +5485,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   	else
>   		num_stripes = map->num_stripes;
>   
> -	preferred_mirror = first + current->pid % num_stripes;
> +	preferred_mirror = first +
> +	    (unsigned)atomic_inc_return(&rr_counter) % num_stripes;
>   
>   	if (dev_replace_is_ongoing &&
>   	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==
> 

I am just a regular user of BTRFS, but the btrfs_fs_info structure 
contains a substructure (dirty_metadata_bytes).

Assuming that writing data leads to a dirty metadata bytes being set to 
non-zero would not something along the lines of this potentially be better?

if(fs_info->dirty_metadata_bytes) atomic_inc_return(&rr_counter);
prefered_mirror = first + (rr_counter+current->pid) % num_stripes;

My knowledge of BTRFS code is near zero and I should know better than 
posting "code" here, but I can't resist getting my point across. There 
must be a simple way to improve slightly over a regular rr scheme.

louis@waffle.tech Oct. 16, 2020, 5:28 p.m. UTC | #5

October 16, 2020 1:29 AM, "Qu Wenruo" <quwenruo.btrfs@gmx.com> wrote:

> On 2020/10/16 下午3:15, Nikolay Borisov wrote:
> 
>> On 16.10.20 г. 8:59 ч., louis@waffle.tech wrote:
>>> Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double
>>> throughput for large reads.
>>> 
>>> Signed-off-by: Louis Jencka <louis@waffle.tech>
>> 
>> Can you show numbers substantiating your claims?
> 
> And isn't this related to the read policy? IIRC some patches are
> floating in the mail list to provide the framework of how the read
> policy should work.
> 
> Those patches aren't yet merged?
> 
> Thanks,
> Qu

Oh, I hadn't seen the read-policy framework thread. Within that, this could be
a new policy or a replacement for BTRFS_READ_POLICY_PID. Is work ongoing for
those patches?

Louis

Anand Jain Oct. 18, 2020, 8:16 a.m. UTC | #6

On 17/10/20 1:28 am, louis@waffle.tech wrote:
> October 16, 2020 1:29 AM, "Qu Wenruo" <quwenruo.btrfs@gmx.com> wrote:
> 
>> On 2020/10/16 下午3:15, Nikolay Borisov wrote:
>>
>>> On 16.10.20 г. 8:59 ч., louis@waffle.tech wrote:
>>>> Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double
>>>> throughput for large reads.
>>>>
>>>> Signed-off-by: Louis Jencka <louis@waffle.tech>
>>>
>>> Can you show numbers substantiating your claims?
>>
>> And isn't this related to the read policy? IIRC some patches are
>> floating in the mail list to provide the framework of how the read
>> policy should work.
>>
>> Those patches aren't yet merged?
>>
>> Thanks,
>> Qu
> 
> Oh, I hadn't seen the read-policy framework thread. Within that, this could be
> a new policy or a replacement for BTRFS_READ_POLICY_PID. Is work ongoing for
> those patches?
> 

  Yes. Those patches are just few days away please stayed tuned.

Thanks, Anand

> Louis
>

Wang Yugui Oct. 23, 2020, 7:38 a.m. UTC | #7

Hi, Louis Jencka

Can we move 'atomic_t rr_counter' into 'struct btrfs_fs_info' to
support multiple mounted btrfs filesystem?

Although 'readmirror feature (read_policy sysfs)'  is talked about, 
round-robin is a replacement for BTRFS_READ_POLICY_PID in most case, 
we no longer need BTRFS_READ_POLICY_PID ?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2020/10/23

> Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads.
> 
> Signed-off-by: Louis Jencka <louis@waffle.tech>
> ---
>  fs/btrfs/volumes.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 58b9c419a..45c581d46 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -333,6 +333,9 @@ struct list_head * __attribute_const__ btrfs_get_fs_uuids(void)
>  	return &fs_uuids;
>  }
>  
> +/* Used for round-robin balancing RAID1/RAID10 reads. */
> +atomic_t rr_counter = ATOMIC_INIT(0);
> +
>  /*
>   * alloc_fs_devices - allocate struct btrfs_fs_devices
>   * @fsid:		if not NULL, copy the UUID to fs_devices::fsid
> @@ -5482,7 +5485,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>  	else
>  		num_stripes = map->num_stripes;
>  
> -	preferred_mirror = first + current->pid % num_stripes;
> +	preferred_mirror = first +
> +	    (unsigned)atomic_inc_return(&rr_counter) % num_stripes;
>  
>  	if (dev_replace_is_ongoing &&
>  	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==

Nikolay Borisov Oct. 23, 2020, 7:42 a.m. UTC | #8

On 23.10.20 г. 10:38 ч., Wang Yugui wrote:
> Hi, Louis Jencka
> 
> Can we move 'atomic_t rr_counter' into 'struct btrfs_fs_info' to
> support multiple mounted btrfs filesystem?
> 

And introduce constant cache line pings for every read. This thing needs
to be tested under load with perf to see what kind of overhead the
shared atomic_t counter adds. My hunch is this should really be a
per-cpu variable.

<snip>

Wang Yugui Oct. 27, 2020, 2:26 p.m. UTC | #9

Hi, Louis Jencka
Cc: Anand Jain

Maybe we still need BTRFS_READ_POLICY_PID because of readahead.

There are readahead inside OS and readahead inside some disk.

For most SSD/SAS and SSD/SATA, there seems readahead inside the disk.
But for some SSD/NVMe,  there seems NO readahead inside the disk.

BTRFS_READ_POLICY_PID cooperates readahead better in some case.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2020/10/27

> Hi, Louis Jencka
> 
> Can we move 'atomic_t rr_counter' into 'struct btrfs_fs_info' to
> support multiple mounted btrfs filesystem?
> 
> Although 'readmirror feature (read_policy sysfs)'  is talked about, 
> round-robin is a replacement for BTRFS_READ_POLICY_PID in most case, 
> we no longer need BTRFS_READ_POLICY_PID ?
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2020/10/23
> 
> > Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads.
> > 
> > Signed-off-by: Louis Jencka <louis@waffle.tech>
> > ---
> >  fs/btrfs/volumes.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 58b9c419a..45c581d46 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -333,6 +333,9 @@ struct list_head * __attribute_const__ btrfs_get_fs_uuids(void)
> >  	return &fs_uuids;
> >  }
> >  
> > +/* Used for round-robin balancing RAID1/RAID10 reads. */
> > +atomic_t rr_counter = ATOMIC_INIT(0);
> > +
> >  /*
> >   * alloc_fs_devices - allocate struct btrfs_fs_devices
> >   * @fsid:		if not NULL, copy the UUID to fs_devices::fsid
> > @@ -5482,7 +5485,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
> >  	else
> >  		num_stripes = map->num_stripes;
> >  
> > -	preferred_mirror = first + current->pid % num_stripes;
> > +	preferred_mirror = first +
> > +	    (unsigned)atomic_inc_return(&rr_counter) % num_stripes;
> >  
> >  	if (dev_replace_is_ongoing &&
> >  	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==
>

louis@waffle.tech Nov. 2, 2020, 7:29 p.m. UTC | #10

Good points. I'll move it into 'struct btrfs_fs_info', look at making it a per-cpu variable, and do some testing with perf (later this week hopefully).

Louis

October 23, 2020 1:42 AM, "Nikolay Borisov" <nborisov@suse.com> wrote:

> On 23.10.20 г. 10:38 ч., Wang Yugui wrote:
> 
>> Hi, Louis Jencka
>> 
>> Can we move 'atomic_t rr_counter' into 'struct btrfs_fs_info' to
>> support multiple mounted btrfs filesystem?
> 
> And introduce constant cache line pings for every read. This thing needs
> to be tested under load with perf to see what kind of overhead the
> shared atomic_t counter adds. My hunch is this should really be a
> per-cpu variable.
> 
> <snip>

louis@waffle.tech Nov. 2, 2020, 7:36 p.m. UTC | #11

Ha nevermind, looks like Anand Jain beat me to the punch.

Louis

November 2, 2020 12:29 PM, louis@waffle.tech wrote:

> Good points. I'll move it into 'struct btrfs_fs_info', look at making it a per-cpu variable, and do
> some testing with perf (later this week hopefully).
> 
> Louis
> 
> October 23, 2020 1:42 AM, "Nikolay Borisov" <nborisov@suse.com> wrote:
> 
>> On 23.10.20 г. 10:38 ч., Wang Yugui wrote:
>> 
>>> Hi, Louis Jencka
>>> 
>>> Can we move 'atomic_t rr_counter' into 'struct btrfs_fs_info' to
>>> support multiple mounted btrfs filesystem?
>> 
>> And introduce constant cache line pings for every read. This thing needs
>> to be tested under load with perf to see what kind of overhead the
>> shared atomic_t counter adds. My hunch is this should really be a
>> per-cpu variable.
>> 
>> <snip>

btrfs: balance RAID1/RAID10 mirror selection

Commit Message

Comments

Patch