Message ID | 64bb4905dc4b77e9fa22d8ba2635a36d15a33469.1610324448.git.anand.jain@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3,1/4] btrfs: add read_policy latency | expand |
On 1/11/21 4:41 AM, Anand Jain wrote: > The read policy type latency routes the read IO based on the historical > average wait-time experienced by the read IOs through the individual > device. This patch obtains the historical read IO stats from the kernel > block layer and calculates its average. > > Example usage: > echo "latency" > /sys/fs/btrfs/$uuid/read_policy > > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > v3: The block layer commit 0d02129e76ed (block: merge struct block_device and > struct hd_struct) has changed the first argument in the function > part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch > fixes it. > Commit log updated. > > v2: Use btrfs_debug_rl() instead of btrfs_info_rl() > It is better we have this debug until we test this on at least few > hardwares. > Drop the unrelated changes. > Update change log. > > v1: Drop part_stat_read_all instead use part_stat_read > Drop inflight > > > fs/btrfs/sysfs.c | 3 ++- > fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++ > fs/btrfs/volumes.h | 2 ++ > 3 files changed, 42 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c > index 19b9fffa2c9c..96ca7bef6357 100644 > --- a/fs/btrfs/sysfs.c > +++ b/fs/btrfs/sysfs.c > @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const char *string) > return false; > } > > -static const char * const btrfs_read_policy_name[] = { "pid" }; > +/* Must follow the order as in enum btrfs_read_policy */ > +static const char * const btrfs_read_policy_name[] = { "pid", "latency" }; > > static ssize_t btrfs_read_policy_show(struct kobject *kobj, > struct kobj_attribute *a, char *buf) > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index f4037a6bd926..f7a0a83d2cd4 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -14,6 +14,7 @@ > #include <linux/semaphore.h> > #include <linux/uuid.h> > #include <linux/list_sort.h> > +#include <linux/part_stat.h> > #include "misc.h" > #include "ctree.h" > #include "extent_map.h" > @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) > return ret; > } > > +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info, > + struct map_lookup *map, int first, > + int num_stripe) > +{ > + u64 est_wait = 0; > + int best_stripe = 0; > + int index; > + > + for (index = first; index < first + num_stripe; index++) { > + u64 read_wait; > + u64 avg_wait = 0; > + unsigned long read_ios; > + struct btrfs_device *device = map->stripes[index].dev; > + > + read_wait = part_stat_read(device->bdev, nsecs[READ]); > + read_ios = part_stat_read(device->bdev, ios[READ]); > + > + if (read_wait && read_ios && read_wait >= read_ios) > + avg_wait = div_u64(read_wait, read_ios); > + else > + btrfs_debug_rl(device->fs_devices->fs_info, You can just use fs_info here, you already have it. I'm not in love with doing this check every time, I'd rather cache the results somewhere. However if we're read-only I can't think of a mechanism we could piggy back on, RW we could just do it every transaction commit. Fix the fs_info thing and you can add Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On 20/1/21 3:36 am, Josef Bacik wrote: > On 1/11/21 4:41 AM, Anand Jain wrote: >> The read policy type latency routes the read IO based on the historical >> average wait-time experienced by the read IOs through the individual >> device. This patch obtains the historical read IO stats from the kernel >> block layer and calculates its average. >> >> Example usage: >> echo "latency" > /sys/fs/btrfs/$uuid/read_policy >> >> Signed-off-by: Anand Jain <anand.jain@oracle.com> >> --- >> v3: The block layer commit 0d02129e76ed (block: merge struct >> block_device and >> struct hd_struct) has changed the first argument in the function >> part_stat_read_all() in 5.11-rc1. So the compilation will fail. >> This patch >> fixes it. >> Commit log updated. >> >> v2: Use btrfs_debug_rl() instead of btrfs_info_rl() >> It is better we have this debug until we test this on at least few >> hardwares. >> Drop the unrelated changes. >> Update change log. >> >> v1: Drop part_stat_read_all instead use part_stat_read >> Drop inflight >> >> >> fs/btrfs/sysfs.c | 3 ++- >> fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++ >> fs/btrfs/volumes.h | 2 ++ >> 3 files changed, 42 insertions(+), 1 deletion(-) >> >> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c >> index 19b9fffa2c9c..96ca7bef6357 100644 >> --- a/fs/btrfs/sysfs.c >> +++ b/fs/btrfs/sysfs.c >> @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const >> char *string) >> return false; >> } >> -static const char * const btrfs_read_policy_name[] = { "pid" }; >> +/* Must follow the order as in enum btrfs_read_policy */ >> +static const char * const btrfs_read_policy_name[] = { "pid", >> "latency" }; >> static ssize_t btrfs_read_policy_show(struct kobject *kobj, >> struct kobj_attribute *a, char *buf) >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index f4037a6bd926..f7a0a83d2cd4 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -14,6 +14,7 @@ >> #include <linux/semaphore.h> >> #include <linux/uuid.h> >> #include <linux/list_sort.h> >> +#include <linux/part_stat.h> >> #include "misc.h" >> #include "ctree.h" >> #include "extent_map.h" >> @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info >> *fs_info, u64 logical, u64 len) >> return ret; >> } >> +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info, >> + struct map_lookup *map, int first, >> + int num_stripe) >> +{ >> + u64 est_wait = 0; >> + int best_stripe = 0; >> + int index; >> + >> + for (index = first; index < first + num_stripe; index++) { >> + u64 read_wait; >> + u64 avg_wait = 0; >> + unsigned long read_ios; >> + struct btrfs_device *device = map->stripes[index].dev; >> + >> + read_wait = part_stat_read(device->bdev, nsecs[READ]); >> + read_ios = part_stat_read(device->bdev, ios[READ]); >> + >> + if (read_wait && read_ios && read_wait >= read_ios) >> + avg_wait = div_u64(read_wait, read_ios); >> + else >> + btrfs_debug_rl(device->fs_devices->fs_info, > > You can just use fs_info here, you already have it. I'm not in love > with doing this check every time, I'd rather cache the results > somewhere. However if we're read-only I can't think of a mechanism we > could piggy back on, RW we could just do it every transaction commit. > Fix the fs_info thing and you can add > Oh. I will fix it. Thanks for the review here and in other patches. -Anand > Reviewed-by: Josef Bacik <josef@toxicpanda.com> > > Thanks, > > Josef
On Mon, Jan 11, 2021 at 05:41:34PM +0800, Anand Jain wrote: > The read policy type latency routes the read IO based on the historical > average wait-time experienced by the read IOs through the individual > device. This patch obtains the historical read IO stats from the kernel > block layer and calculates its average. > > Example usage: > echo "latency" > /sys/fs/btrfs/$uuid/read_policy > > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > v3: The block layer commit 0d02129e76ed (block: merge struct block_device and > struct hd_struct) has changed the first argument in the function > part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch > fixes it. > Commit log updated. > > v2: Use btrfs_debug_rl() instead of btrfs_info_rl() > It is better we have this debug until we test this on at least few > hardwares. > Drop the unrelated changes. > Update change log. > > v1: Drop part_stat_read_all instead use part_stat_read > Drop inflight > Hi Anand, I tested this policy with fio and dstat. It performs overall really well. On my raid1c3 array with two HDDs and one SSD (which is the last device), I'm getting the following results. With direct=0: Run status group 0 (all jobs): READ: bw=3560MiB/s (3733MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s), io=3129GiB (3360GB), run=900003-900013msec With direct=1: Run status group 0 (all jobs): READ: bw=520MiB/s (545MB/s), 64.9MiB/s-65.0MiB/s (68.1MB/s-68.2MB/s), io=457GiB (490GB), run=900001-900001msec However, I was also running dstat at the same time and I noticed that the read stop sometimes for ~15-20 seconds. For example: ----system---- --dsk/sdb-- --dsk/sdc-- --dsk/sdd-- 20-01 00:37:21| 0 0 : 0 0 : 509M 0 20-01 00:37:22| 0 0 : 0 0 : 517M 0 20-01 00:37:23| 0 0 : 0 0 : 507M 0 20-01 00:37:24| 0 0 : 0 0 : 518M 0 20-01 00:37:25| 0 0 : 0 0 : 22M 0 20-01 00:37:26| 0 0 : 0 0 : 0 0 20-01 00:37:27| 0 0 : 0 0 : 0 0 20-01 00:37:28| 0 0 : 0 0 : 0 0 20-01 00:37:29| 0 0 : 0 0 : 0 0 20-01 00:37:30| 0 0 : 0 0 : 0 0 20-01 00:37:31| 0 0 : 0 0 : 0 0 20-01 00:37:32| 0 0 : 0 0 : 0 0 20-01 00:37:33| 0 0 : 0 0 : 0 0 20-01 00:37:34| 0 0 : 0 0 : 0 0 20-01 00:37:35| 0 0 : 0 0 : 0 0 20-01 00:37:36| 0 0 : 0 0 : 0 0 20-01 00:37:37| 0 0 : 0 0 : 0 0 20-01 00:37:38| 0 0 : 0 0 : 0 0 20-01 00:37:39| 0 0 : 0 0 : 0 0 20-01 00:37:40| 0 0 : 0 0 : 0 0 20-01 00:37:41| 0 0 : 0 0 : 0 0 20-01 00:37:42| 0 0 : 0 0 : 0 0 20-01 00:37:43| 0 0 : 0 0 : 0 0 20-01 00:37:44| 0 0 : 0 0 : 0 0 20-01 00:37:45| 0 0 : 0 0 : 0 0 20-01 00:37:46| 0 0 : 0 0 : 55M 0 20-01 00:37:47| 0 0 : 0 0 : 516M 0 20-01 00:37:48| 0 0 : 0 0 : 515M 0 20-01 00:37:49| 0 0 : 0 0 : 516M 0 20-01 00:37:50| 0 0 : 0 0 : 520M 0 20-01 00:37:51| 0 0 : 0 0 : 520M 0 20-01 00:37:52| 0 0 : 0 0 : 514M 0 Here is the full log: https://susepaste.org/16928336 I never noticed that happening with the PID policy. Is that maybe because of reading the part stats for all CPUs while selecting the mirror? Michal
> Hi Anand, > > I tested this policy with fio and dstat. It performs overall really > well. On my raid1c3 array with two HDDs and one SSD (which is the last > device), I'm getting the following results. > Michal, Thank you for verifying. More below... > With direct=0: > > Run status group 0 (all jobs): > READ: bw=3560MiB/s (3733MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s), > io=3129GiB (3360GB), run=900003-900013msec > > With direct=1: > > Run status group 0 (all jobs): > READ: bw=520MiB/s (545MB/s), 64.9MiB/s-65.0MiB/s (68.1MB/s-68.2MB/s), > io=457GiB (490GB), run=900001-900001msec > > However, I was also running dstat at the same time and I noticed that > the read stop sometimes for ~15-20 seconds. For example: > > ----system---- --dsk/sdb-- --dsk/sdc-- --dsk/sdd-- > 20-01 00:37:21| 0 0 : 0 0 : 509M 0 > 20-01 00:37:22| 0 0 : 0 0 : 517M 0 > 20-01 00:37:23| 0 0 : 0 0 : 507M 0 > 20-01 00:37:24| 0 0 : 0 0 : 518M 0 > 20-01 00:37:25| 0 0 : 0 0 : 22M 0 > 20-01 00:37:26| 0 0 : 0 0 : 0 0 > 20-01 00:37:27| 0 0 : 0 0 : 0 0 > 20-01 00:37:28| 0 0 : 0 0 : 0 0 > 20-01 00:37:29| 0 0 : 0 0 : 0 0 > 20-01 00:37:30| 0 0 : 0 0 : 0 0 > 20-01 00:37:31| 0 0 : 0 0 : 0 0 > 20-01 00:37:32| 0 0 : 0 0 : 0 0 > 20-01 00:37:33| 0 0 : 0 0 : 0 0 > 20-01 00:37:34| 0 0 : 0 0 : 0 0 > 20-01 00:37:35| 0 0 : 0 0 : 0 0 > 20-01 00:37:36| 0 0 : 0 0 : 0 0 > 20-01 00:37:37| 0 0 : 0 0 : 0 0 > 20-01 00:37:38| 0 0 : 0 0 : 0 0 > 20-01 00:37:39| 0 0 : 0 0 : 0 0 > 20-01 00:37:40| 0 0 : 0 0 : 0 0 > 20-01 00:37:41| 0 0 : 0 0 : 0 0 > 20-01 00:37:42| 0 0 : 0 0 : 0 0 > 20-01 00:37:43| 0 0 : 0 0 : 0 0 > 20-01 00:37:44| 0 0 : 0 0 : 0 0 > 20-01 00:37:45| 0 0 : 0 0 : 0 0 > 20-01 00:37:46| 0 0 : 0 0 : 55M 0 > 20-01 00:37:47| 0 0 : 0 0 : 516M 0 > 20-01 00:37:48| 0 0 : 0 0 : 515M 0 > 20-01 00:37:49| 0 0 : 0 0 : 516M 0 > 20-01 00:37:50| 0 0 : 0 0 : 520M 0 > 20-01 00:37:51| 0 0 : 0 0 : 520M 0 > 20-01 00:37:52| 0 0 : 0 0 : 514M 0 > > Here is the full log: > > https://susepaste.org/16928336 > > I never noticed that happening with the PID policy. Is that maybe > because of reading the part stats for all CPUs while selecting the > mirror? > I ran fio tests again, now with dstat in an another window. I don't notice any such stalls, the read numbers went continuous until fio finished. Could you please check with the below fio command, also could you please share your fio command options. fio \ --filename=/btrfs/largefile \ --directory=/btrfs \ --filesize=50G \ --size=50G \ --bs=64k \ --ioengine=libaio \ --rw=read \ --direct=1 \ --numjobs=1 \ --group_reporting \ --thread \ --name iops-test-job It is system specific? Thanks. Anand > Michal >
On Wed, Jan 20, 2021 at 08:30:56PM +0800, Anand Jain wrote: > I ran fio tests again, now with dstat in an another window. I don't > notice any such stalls, the read numbers went continuous until fio > finished. Could you please check with the below fio command, also > could you please share your fio command options. That's the fio config I used: https://gitlab.com/vadorovsky/playground/-/blob/master/fio/btrfs-raid1-seqread.fio The main differences seem to be: - the number of jobs (I used the number of CPU threads) - direct vs buffered I/O > > fio \ > --filename=/btrfs/largefile \ > --directory=/btrfs \ > --filesize=50G \ > --size=50G \ > --bs=64k \ > --ioengine=libaio \ > --rw=read \ > --direct=1 \ > --numjobs=1 \ > --group_reporting \ > --thread \ > --name iops-test-job > > It is system specific? With this command, dstat output looks good: https://paste.opensuse.org/view/simple/93159623 So I think it might be specific to whether we test direct of buffered I/O. Or to the number of jobs (single vs multiple jobs). Since the most of I/O on production environments is usually buffered, I think we should test with direct=0 too. Cheers, Michal
On 20/1/21 9:54 pm, Michal Rostecki wrote: > On Wed, Jan 20, 2021 at 08:30:56PM +0800, Anand Jain wrote: >> I ran fio tests again, now with dstat in an another window. I don't >> notice any such stalls, the read numbers went continuous until fio >> finished. Could you please check with the below fio command, also >> could you please share your fio command options. > > That's the fio config I used: > > https://gitlab.com/vadorovsky/playground/-/blob/master/fio/btrfs-raid1-seqread.fio > > The main differences seem to be: > - the number of jobs (I used the number of CPU threads) > - direct vs buffered I/O > >> >> fio \ >> --filename=/btrfs/largefile \ >> --directory=/btrfs \ >> --filesize=50G \ >> --size=50G \ >> --bs=64k \ >> --ioengine=libaio \ >> --rw=read \ >> --direct=1 \ >> --numjobs=1 \ >> --group_reporting \ >> --thread \ >> --name iops-test-job >> >> It is system specific? > > With this command, dstat output looks good: > > https://paste.opensuse.org/view/simple/93159623 > > So I think it might be specific to whether we test direct of buffered > I/O. Or to the number of jobs (single vs multiple jobs). Since the most > of I/O on production environments is usually buffered, I think we should > test with direct=0 too. > It explains the stall for now, so it might be reading from the cache so there was actually no IO to the device. Agreed it must be tested with the most common type of IO. But as the cache comes into play I was a bit reluctant. Thanks, Anand > Cheers, > Michal >
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index 19b9fffa2c9c..96ca7bef6357 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const char *string) return false; } -static const char * const btrfs_read_policy_name[] = { "pid" }; +/* Must follow the order as in enum btrfs_read_policy */ +static const char * const btrfs_read_policy_name[] = { "pid", "latency" }; static ssize_t btrfs_read_policy_show(struct kobject *kobj, struct kobj_attribute *a, char *buf) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f4037a6bd926..f7a0a83d2cd4 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -14,6 +14,7 @@ #include <linux/semaphore.h> #include <linux/uuid.h> #include <linux/list_sort.h> +#include <linux/part_stat.h> #include "misc.h" #include "ctree.h" #include "extent_map.h" @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info, + struct map_lookup *map, int first, + int num_stripe) +{ + u64 est_wait = 0; + int best_stripe = 0; + int index; + + for (index = first; index < first + num_stripe; index++) { + u64 read_wait; + u64 avg_wait = 0; + unsigned long read_ios; + struct btrfs_device *device = map->stripes[index].dev; + + read_wait = part_stat_read(device->bdev, nsecs[READ]); + read_ios = part_stat_read(device->bdev, ios[READ]); + + if (read_wait && read_ios && read_wait >= read_ios) + avg_wait = div_u64(read_wait, read_ios); + else + btrfs_debug_rl(device->fs_devices->fs_info, + "devid: %llu avg_wait ZERO read_wait %llu read_ios %lu", + device->devid, read_wait, read_ios); + + if (est_wait == 0 || est_wait > avg_wait) { + est_wait = avg_wait; + best_stripe = index; + } + } + + return best_stripe; +} + static int find_live_mirror(struct btrfs_fs_info *fs_info, struct map_lookup *map, int first, int dev_replace_is_ongoing) @@ -5519,6 +5553,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, case BTRFS_READ_POLICY_PID: preferred_mirror = first + (current->pid % num_stripes); break; + case BTRFS_READ_POLICY_LATENCY: + preferred_mirror = btrfs_find_best_stripe(fs_info, map, first, + num_stripes); + break; } if (dev_replace_is_ongoing && diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 1997a4649a66..71ba1f0e93f4 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -222,6 +222,8 @@ enum btrfs_chunk_allocation_policy { enum btrfs_read_policy { /* Use process PID to choose the stripe */ BTRFS_READ_POLICY_PID, + /* Find and use device with the lowest latency */ + BTRFS_READ_POLICY_LATENCY, BTRFS_NR_READ_POLICY, };
The read policy type latency routes the read IO based on the historical average wait-time experienced by the read IOs through the individual device. This patch obtains the historical read IO stats from the kernel block layer and calculates its average. Example usage: echo "latency" > /sys/fs/btrfs/$uuid/read_policy Signed-off-by: Anand Jain <anand.jain@oracle.com> --- v3: The block layer commit 0d02129e76ed (block: merge struct block_device and struct hd_struct) has changed the first argument in the function part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch fixes it. Commit log updated. v2: Use btrfs_debug_rl() instead of btrfs_info_rl() It is better we have this debug until we test this on at least few hardwares. Drop the unrelated changes. Update change log. v1: Drop part_stat_read_all instead use part_stat_read Drop inflight fs/btrfs/sysfs.c | 3 ++- fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.h | 2 ++ 3 files changed, 42 insertions(+), 1 deletion(-)