[v3,1/4] btrfs: add read_policy latency

Message ID	64bb4905dc4b77e9fa22d8ba2635a36d15a33469.1610324448.git.anand.jain@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Anand Jain <anand.jain@oracle.com> To: linux-btrfs@vger.kernel.org Cc: dsterba@suse.com, josef@toxicpanda.com, Anand Jain <anand.jain@oracle.com> Subject: [PATCH v3 1/4] btrfs: add read_policy latency Date: Mon, 11 Jan 2021 17:41:34 +0800 Message-Id: <64bb4905dc4b77e9fa22d8ba2635a36d15a33469.1610324448.git.anand.jain@oracle.com> In-Reply-To: <cover.1610324448.git.anand.jain@oracle.com> References: <cover.1610324448.git.anand.jain@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v3,1/4] btrfs: add read_policy latency \| expand [v3,1/4] btrfs: add read_policy latency [v3,2/4] btrfs: introduce new device-state read_preferred [v3,3/4] btrfs: introduce new read_policy device [RFC,4/4] btrfs: introduce new read_policy round-robin

Anand Jain Jan. 11, 2021, 9:41 a.m. UTC

The read policy type latency routes the read IO based on the historical
average wait-time experienced by the read IOs through the individual
device. This patch obtains the historical read IO stats from the kernel
block layer and calculates its average.

Example usage:
 echo "latency" > /sys/fs/btrfs/$uuid/read_policy 

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v3: The block layer commit 0d02129e76ed (block: merge struct block_device and
    struct hd_struct) has changed the first argument in the function
    part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch
    fixes it.
    Commit log updated.

v2: Use btrfs_debug_rl() instead of btrfs_info_rl()
    It is better we have this debug until we test this on at least few
    hardwares.
    Drop the unrelated changes.
    Update change log.

v1: Drop part_stat_read_all instead use part_stat_read
    Drop inflight


 fs/btrfs/sysfs.c   |  3 ++-
 fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 42 insertions(+), 1 deletion(-)

Josef Bacik Jan. 19, 2021, 7:36 p.m. UTC | #1

On 1/11/21 4:41 AM, Anand Jain wrote:
> The read policy type latency routes the read IO based on the historical
> average wait-time experienced by the read IOs through the individual
> device. This patch obtains the historical read IO stats from the kernel
> block layer and calculates its average.
> 
> Example usage:
>   echo "latency" > /sys/fs/btrfs/$uuid/read_policy
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
> v3: The block layer commit 0d02129e76ed (block: merge struct block_device and
>      struct hd_struct) has changed the first argument in the function
>      part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch
>      fixes it.
>      Commit log updated.
> 
> v2: Use btrfs_debug_rl() instead of btrfs_info_rl()
>      It is better we have this debug until we test this on at least few
>      hardwares.
>      Drop the unrelated changes.
>      Update change log.
> 
> v1: Drop part_stat_read_all instead use part_stat_read
>      Drop inflight
> 
> 
>   fs/btrfs/sysfs.c   |  3 ++-
>   fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  2 ++
>   3 files changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 19b9fffa2c9c..96ca7bef6357 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const char *string)
>   	return false;
>   }
>   
> -static const char * const btrfs_read_policy_name[] = { "pid" };
> +/* Must follow the order as in enum btrfs_read_policy */
> +static const char * const btrfs_read_policy_name[] = { "pid", "latency" };
>   
>   static ssize_t btrfs_read_policy_show(struct kobject *kobj,
>   				      struct kobj_attribute *a, char *buf)
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index f4037a6bd926..f7a0a83d2cd4 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -14,6 +14,7 @@
>   #include <linux/semaphore.h>
>   #include <linux/uuid.h>
>   #include <linux/list_sort.h>
> +#include <linux/part_stat.h>
>   #include "misc.h"
>   #include "ctree.h"
>   #include "extent_map.h"
> @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>   	return ret;
>   }
>   
> +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info,
> +				  struct map_lookup *map, int first,
> +				  int num_stripe)
> +{
> +	u64 est_wait = 0;
> +	int best_stripe = 0;
> +	int index;
> +
> +	for (index = first; index < first + num_stripe; index++) {
> +		u64 read_wait;
> +		u64 avg_wait = 0;
> +		unsigned long read_ios;
> +		struct btrfs_device *device = map->stripes[index].dev;
> +
> +		read_wait = part_stat_read(device->bdev, nsecs[READ]);
> +		read_ios = part_stat_read(device->bdev, ios[READ]);
> +
> +		if (read_wait && read_ios && read_wait >= read_ios)
> +			avg_wait = div_u64(read_wait, read_ios);
> +		else
> +			btrfs_debug_rl(device->fs_devices->fs_info,

You can just use fs_info here, you already have it.  I'm not in love with doing 
this check every time, I'd rather cache the results somewhere.  However if we're 
read-only I can't think of a mechanism we could piggy back on, RW we could just 
do it every transaction commit.  Fix the fs_info thing and you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

Anand Jain Jan. 20, 2021, 2:43 a.m. UTC | #2

On 20/1/21 3:36 am, Josef Bacik wrote:
> On 1/11/21 4:41 AM, Anand Jain wrote:
>> The read policy type latency routes the read IO based on the historical
>> average wait-time experienced by the read IOs through the individual
>> device. This patch obtains the historical read IO stats from the kernel
>> block layer and calculates its average.
>>
>> Example usage:
>>   echo "latency" > /sys/fs/btrfs/$uuid/read_policy
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>> v3: The block layer commit 0d02129e76ed (block: merge struct 
>> block_device and
>>      struct hd_struct) has changed the first argument in the function
>>      part_stat_read_all() in 5.11-rc1. So the compilation will fail. 
>> This patch
>>      fixes it.
>>      Commit log updated.
>>
>> v2: Use btrfs_debug_rl() instead of btrfs_info_rl()
>>      It is better we have this debug until we test this on at least few
>>      hardwares.
>>      Drop the unrelated changes.
>>      Update change log.
>>
>> v1: Drop part_stat_read_all instead use part_stat_read
>>      Drop inflight
>>
>>
>>   fs/btrfs/sysfs.c   |  3 ++-
>>   fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.h |  2 ++
>>   3 files changed, 42 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
>> index 19b9fffa2c9c..96ca7bef6357 100644
>> --- a/fs/btrfs/sysfs.c
>> +++ b/fs/btrfs/sysfs.c
>> @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const 
>> char *string)
>>       return false;
>>   }
>> -static const char * const btrfs_read_policy_name[] = { "pid" };
>> +/* Must follow the order as in enum btrfs_read_policy */
>> +static const char * const btrfs_read_policy_name[] = { "pid", 
>> "latency" };
>>   static ssize_t btrfs_read_policy_show(struct kobject *kobj,
>>                         struct kobj_attribute *a, char *buf)
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index f4037a6bd926..f7a0a83d2cd4 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -14,6 +14,7 @@
>>   #include <linux/semaphore.h>
>>   #include <linux/uuid.h>
>>   #include <linux/list_sort.h>
>> +#include <linux/part_stat.h>
>>   #include "misc.h"
>>   #include "ctree.h"
>>   #include "extent_map.h"
>> @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
>> *fs_info, u64 logical, u64 len)
>>       return ret;
>>   }
>> +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info,
>> +                  struct map_lookup *map, int first,
>> +                  int num_stripe)
>> +{
>> +    u64 est_wait = 0;
>> +    int best_stripe = 0;
>> +    int index;
>> +
>> +    for (index = first; index < first + num_stripe; index++) {
>> +        u64 read_wait;
>> +        u64 avg_wait = 0;
>> +        unsigned long read_ios;
>> +        struct btrfs_device *device = map->stripes[index].dev;
>> +
>> +        read_wait = part_stat_read(device->bdev, nsecs[READ]);
>> +        read_ios = part_stat_read(device->bdev, ios[READ]);
>> +
>> +        if (read_wait && read_ios && read_wait >= read_ios)
>> +            avg_wait = div_u64(read_wait, read_ios);
>> +        else
>> +            btrfs_debug_rl(device->fs_devices->fs_info,
> 
> You can just use fs_info here, you already have it.  I'm not in love 
> with doing this check every time, I'd rather cache the results 
> somewhere.  However if we're read-only I can't think of a mechanism we 
> could piggy back on, RW we could just do it every transaction commit.  
> Fix the fs_info thing and you can add
> 

  Oh. I will fix it. Thanks for the review here and in other patches.
-Anand

> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> 
> Thanks,
> 
> Josef

Michal Rostecki Jan. 20, 2021, 10:27 a.m. UTC | #3

On Mon, Jan 11, 2021 at 05:41:34PM +0800, Anand Jain wrote:
> The read policy type latency routes the read IO based on the historical
> average wait-time experienced by the read IOs through the individual
> device. This patch obtains the historical read IO stats from the kernel
> block layer and calculates its average.
> 
> Example usage:
>  echo "latency" > /sys/fs/btrfs/$uuid/read_policy 
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
> v3: The block layer commit 0d02129e76ed (block: merge struct block_device and
>     struct hd_struct) has changed the first argument in the function
>     part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch
>     fixes it.
>     Commit log updated.
> 
> v2: Use btrfs_debug_rl() instead of btrfs_info_rl()
>     It is better we have this debug until we test this on at least few
>     hardwares.
>     Drop the unrelated changes.
>     Update change log.
> 
> v1: Drop part_stat_read_all instead use part_stat_read
>     Drop inflight
> 

Hi Anand,

I tested this policy with fio and dstat. It performs overall really
well. On my raid1c3 array with two HDDs and one SSD (which is the last
device), I'm getting the following results.

With direct=0:

  Run status group 0 (all jobs):
     READ: bw=3560MiB/s (3733MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s),
     io=3129GiB (3360GB), run=900003-900013msec

With direct=1:

  Run status group 0 (all jobs):
     READ: bw=520MiB/s (545MB/s), 64.9MiB/s-65.0MiB/s (68.1MB/s-68.2MB/s),
     io=457GiB (490GB), run=900001-900001msec

However, I was also running dstat at the same time and I noticed that
the read stop sometimes for ~15-20 seconds. For example:

  ----system---- --dsk/sdb-- --dsk/sdc-- --dsk/sdd--
  20-01 00:37:21|   0     0 :   0     0 : 509M    0
  20-01 00:37:22|   0     0 :   0     0 : 517M    0
  20-01 00:37:23|   0     0 :   0     0 : 507M    0
  20-01 00:37:24|   0     0 :   0     0 : 518M    0
  20-01 00:37:25|   0     0 :   0     0 :  22M    0
  20-01 00:37:26|   0     0 :   0     0 :   0     0
  20-01 00:37:27|   0     0 :   0     0 :   0     0
  20-01 00:37:28|   0     0 :   0     0 :   0     0
  20-01 00:37:29|   0     0 :   0     0 :   0     0
  20-01 00:37:30|   0     0 :   0     0 :   0     0
  20-01 00:37:31|   0     0 :   0     0 :   0     0
  20-01 00:37:32|   0     0 :   0     0 :   0     0
  20-01 00:37:33|   0     0 :   0     0 :   0     0
  20-01 00:37:34|   0     0 :   0     0 :   0     0
  20-01 00:37:35|   0     0 :   0     0 :   0     0
  20-01 00:37:36|   0     0 :   0     0 :   0     0
  20-01 00:37:37|   0     0 :   0     0 :   0     0
  20-01 00:37:38|   0     0 :   0     0 :   0     0
  20-01 00:37:39|   0     0 :   0     0 :   0     0
  20-01 00:37:40|   0     0 :   0     0 :   0     0
  20-01 00:37:41|   0     0 :   0     0 :   0     0
  20-01 00:37:42|   0     0 :   0     0 :   0     0
  20-01 00:37:43|   0     0 :   0     0 :   0     0
  20-01 00:37:44|   0     0 :   0     0 :   0     0
  20-01 00:37:45|   0     0 :   0     0 :   0     0
  20-01 00:37:46|   0     0 :   0     0 :  55M    0
  20-01 00:37:47|   0     0 :   0     0 : 516M    0
  20-01 00:37:48|   0     0 :   0     0 : 515M    0
  20-01 00:37:49|   0     0 :   0     0 : 516M    0
  20-01 00:37:50|   0     0 :   0     0 : 520M    0
  20-01 00:37:51|   0     0 :   0     0 : 520M    0
  20-01 00:37:52|   0     0 :   0     0 : 514M    0

Here is the full log:

https://susepaste.org/16928336

I never noticed that happening with the PID policy. Is that maybe
because of reading the part stats for all CPUs while selecting the
mirror?

Michal

Anand Jain Jan. 20, 2021, 12:30 p.m. UTC | #4

> Hi Anand,
> 
> I tested this policy with fio and dstat. It performs overall really
> well. On my raid1c3 array with two HDDs and one SSD (which is the last
> device), I'm getting the following results.
> 

Michal,

  Thank you for verifying. More below...

> With direct=0:
> 
>    Run status group 0 (all jobs):
>       READ: bw=3560MiB/s (3733MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s),
>       io=3129GiB (3360GB), run=900003-900013msec
> 
> With direct=1:
> 
>    Run status group 0 (all jobs):
>       READ: bw=520MiB/s (545MB/s), 64.9MiB/s-65.0MiB/s (68.1MB/s-68.2MB/s),
>       io=457GiB (490GB), run=900001-900001msec
> 
> However, I was also running dstat at the same time and I noticed that
> the read stop sometimes for ~15-20 seconds. For example:
> 
>    ----system---- --dsk/sdb-- --dsk/sdc-- --dsk/sdd--
>    20-01 00:37:21|   0     0 :   0     0 : 509M    0
>    20-01 00:37:22|   0     0 :   0     0 : 517M    0
>    20-01 00:37:23|   0     0 :   0     0 : 507M    0
>    20-01 00:37:24|   0     0 :   0     0 : 518M    0
>    20-01 00:37:25|   0     0 :   0     0 :  22M    0
>    20-01 00:37:26|   0     0 :   0     0 :   0     0
>    20-01 00:37:27|   0     0 :   0     0 :   0     0
>    20-01 00:37:28|   0     0 :   0     0 :   0     0
>    20-01 00:37:29|   0     0 :   0     0 :   0     0
>    20-01 00:37:30|   0     0 :   0     0 :   0     0
>    20-01 00:37:31|   0     0 :   0     0 :   0     0
>    20-01 00:37:32|   0     0 :   0     0 :   0     0
>    20-01 00:37:33|   0     0 :   0     0 :   0     0
>    20-01 00:37:34|   0     0 :   0     0 :   0     0
>    20-01 00:37:35|   0     0 :   0     0 :   0     0
>    20-01 00:37:36|   0     0 :   0     0 :   0     0
>    20-01 00:37:37|   0     0 :   0     0 :   0     0
>    20-01 00:37:38|   0     0 :   0     0 :   0     0
>    20-01 00:37:39|   0     0 :   0     0 :   0     0
>    20-01 00:37:40|   0     0 :   0     0 :   0     0
>    20-01 00:37:41|   0     0 :   0     0 :   0     0
>    20-01 00:37:42|   0     0 :   0     0 :   0     0
>    20-01 00:37:43|   0     0 :   0     0 :   0     0
>    20-01 00:37:44|   0     0 :   0     0 :   0     0
>    20-01 00:37:45|   0     0 :   0     0 :   0     0
>    20-01 00:37:46|   0     0 :   0     0 :  55M    0
>    20-01 00:37:47|   0     0 :   0     0 : 516M    0
>    20-01 00:37:48|   0     0 :   0     0 : 515M    0
>    20-01 00:37:49|   0     0 :   0     0 : 516M    0
>    20-01 00:37:50|   0     0 :   0     0 : 520M    0
>    20-01 00:37:51|   0     0 :   0     0 : 520M    0
>    20-01 00:37:52|   0     0 :   0     0 : 514M    0
> 
> Here is the full log:
> 
> https://susepaste.org/16928336
> 
> I never noticed that happening with the PID policy. Is that maybe
> because of reading the part stats for all CPUs while selecting the
> mirror?
> 

  I ran fio tests again, now with dstat in an another window. I don't
  notice any such stalls, the read numbers went continuous until fio
  finished. Could you please check with the below fio command, also
  could you please share your fio command options.

fio \
--filename=/btrfs/largefile \
--directory=/btrfs \
--filesize=50G \
--size=50G \
--bs=64k \
--ioengine=libaio \
--rw=read \
--direct=1 \
--numjobs=1 \
--group_reporting \
--thread \
--name iops-test-job

  It is system specific?

Thanks.
Anand


> Michal
>

Michal Rostecki Jan. 20, 2021, 1:54 p.m. UTC | #5

On Wed, Jan 20, 2021 at 08:30:56PM +0800, Anand Jain wrote:
>  I ran fio tests again, now with dstat in an another window. I don't
>  notice any such stalls, the read numbers went continuous until fio
>  finished. Could you please check with the below fio command, also
>  could you please share your fio command options.

That's the fio config I used:

https://gitlab.com/vadorovsky/playground/-/blob/master/fio/btrfs-raid1-seqread.fio

The main differences seem to be:
- the number of jobs (I used the number of CPU threads)
- direct vs buffered I/O

> 
> fio \
> --filename=/btrfs/largefile \
> --directory=/btrfs \
> --filesize=50G \
> --size=50G \
> --bs=64k \
> --ioengine=libaio \
> --rw=read \
> --direct=1 \
> --numjobs=1 \
> --group_reporting \
> --thread \
> --name iops-test-job
> 
>  It is system specific?

With this command, dstat output looks good:

https://paste.opensuse.org/view/simple/93159623

So I think it might be specific to whether we test direct of buffered
I/O. Or to the number of jobs (single vs multiple jobs). Since the most
of I/O on production environments is usually buffered, I think we should
test with direct=0 too.

Cheers,
Michal

Anand Jain Jan. 21, 2021, 10:45 a.m. UTC | #6

On 20/1/21 9:54 pm, Michal Rostecki wrote:
> On Wed, Jan 20, 2021 at 08:30:56PM +0800, Anand Jain wrote:
>>   I ran fio tests again, now with dstat in an another window. I don't
>>   notice any such stalls, the read numbers went continuous until fio
>>   finished. Could you please check with the below fio command, also
>>   could you please share your fio command options.
> 
> That's the fio config I used:
> 
> https://gitlab.com/vadorovsky/playground/-/blob/master/fio/btrfs-raid1-seqread.fio
> 
> The main differences seem to be:
> - the number of jobs (I used the number of CPU threads)
> - direct vs buffered I/O
> 
>>
>> fio \
>> --filename=/btrfs/largefile \
>> --directory=/btrfs \
>> --filesize=50G \
>> --size=50G \
>> --bs=64k \
>> --ioengine=libaio \
>> --rw=read \
>> --direct=1 \
>> --numjobs=1 \
>> --group_reporting \
>> --thread \
>> --name iops-test-job
>>
>>   It is system specific?
> 
> With this command, dstat output looks good:
> 
> https://paste.opensuse.org/view/simple/93159623
> 
> So I think it might be specific to whether we test direct of buffered
> I/O. Or to the number of jobs (single vs multiple jobs). Since the most
> of I/O on production environments is usually buffered, I think we should
> test with direct=0 too.
> 
  It explains the stall for now, so it might be reading from the cache
  so there was actually no IO to the device.

  Agreed it must be tested with the most common type of IO. But as the
  cache comes into play I was a bit reluctant.

Thanks, Anand


> Cheers,
> Michal
>

[v3,1/4] btrfs: add read_policy latency

Commit Message

Comments

Patch