Message ID | 63f6f00e2ecc741efd2200c3c87b5db52c6be2fd.1611114341.git.anand.jain@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: read_policy types latency, device and round-robin | expand |
On Tue, Jan 19, 2021 at 11:52:05PM -0800, Anand Jain wrote: > The read policy type latency routes the read IO based on the historical > average wait-time experienced by the read IOs through the individual > device. This patch obtains the historical read IO stats from the kernel > block layer and calculates its average. This does not say how the stripe is selected using the gathered numbers. Ie. what is the criteria like minimum average time, "based on" is too vague. > Example usage: > echo "latency" > /sys/fs/btrfs/$uuid/read_policy Do you have some sample results? I remember you posted something but it would be good to have that in the changelog too. > Signed-off-by: Anand Jain <anand.jain@oracle.com> > Reviewed-by: Josef Bacik <josef@toxicpanda.com> > --- > v4: For btrfs_debug_rl() use fs_info instead of > device->fs_devices->fs_info. > > v3: The block layer commit 0d02129e76ed (block: merge struct block_device and > struct hd_struct) has changed the first argument in the function > part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch > fixes it. > Commit log updated. > > v2: Use btrfs_debug_rl() instead of btrfs_info_rl() > It is better we have this debug until we test this on at least few > hardwares. > Drop the unrelated changes. > Update change log. > > rfc->v1: Drop part_stat_read_all instead use part_stat_read > Drop inflight > > fs/btrfs/sysfs.c | 3 ++- > fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++ > fs/btrfs/volumes.h | 2 ++ > 3 files changed, 42 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c > index 4522a1c4cd08..7c0324fe97b2 100644 > --- a/fs/btrfs/sysfs.c > +++ b/fs/btrfs/sysfs.c > @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const char *string) > return false; > } > > -static const char * const btrfs_read_policy_name[] = { "pid" }; > +/* Must follow the order as in enum btrfs_read_policy */ > +static const char * const btrfs_read_policy_name[] = { "pid", "latency" }; > > static ssize_t btrfs_read_policy_show(struct kobject *kobj, > struct kobj_attribute *a, char *buf) > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 62d6a890fc50..f361f1c87eb6 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -14,6 +14,7 @@ > #include <linux/semaphore.h> > #include <linux/uuid.h> > #include <linux/list_sort.h> > +#include <linux/part_stat.h> > #include "misc.h" > #include "ctree.h" > #include "extent_map.h" > @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) > return ret; > } > > +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info, The name btrfs_find_best_stripe should be more descriptive about the selection criteria. > + struct map_lookup *map, int first, > + int num_stripe) > +{ > + u64 est_wait = 0; > + int best_stripe = 0; > + int index; > + > + for (index = first; index < first + num_stripe; index++) { > + u64 read_wait; > + u64 avg_wait = 0; > + unsigned long read_ios; > + struct btrfs_device *device = map->stripes[index].dev; > + > + read_wait = part_stat_read(device->bdev, nsecs[READ]); This should use STAT_READ as this is supposed to be indexing the stats members. READ is some generic constant with the same value. > + read_ios = part_stat_read(device->bdev, ios[READ]); > + > + if (read_wait && read_ios && read_wait >= read_ios) > + avg_wait = div_u64(read_wait, read_ios); > + else > + btrfs_debug_rl(fs_info, > + "devid: %llu avg_wait ZERO read_wait %llu read_ios %lu", > + device->devid, read_wait, read_ios); > + > + if (est_wait == 0 || est_wait > avg_wait) { > + est_wait = avg_wait; > + best_stripe = index; > + } > + } > + > + return best_stripe; > +} > + > static int find_live_mirror(struct btrfs_fs_info *fs_info, > struct map_lookup *map, int first, > int dev_replace_is_ongoing) > @@ -5519,6 +5553,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, > case BTRFS_READ_POLICY_PID: > preferred_mirror = first + (current->pid % num_stripes); > break; > + case BTRFS_READ_POLICY_LATENCY: > + preferred_mirror = btrfs_find_best_stripe(fs_info, map, first, > + num_stripes); > + break; > } > > if (dev_replace_is_ongoing && > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index 1997a4649a66..71ba1f0e93f4 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -222,6 +222,8 @@ enum btrfs_chunk_allocation_policy { > enum btrfs_read_policy { > /* Use process PID to choose the stripe */ > BTRFS_READ_POLICY_PID, > + /* Find and use device with the lowest latency */ > + BTRFS_READ_POLICY_LATENCY, > BTRFS_NR_READ_POLICY, > }; > > -- > 2.28.0
On 20/1/21 8:14 pm, David Sterba wrote: > On Tue, Jan 19, 2021 at 11:52:05PM -0800, Anand Jain wrote: >> The read policy type latency routes the read IO based on the historical >> average wait-time experienced by the read IOs through the individual >> device. This patch obtains the historical read IO stats from the kernel >> block layer and calculates its average. > > This does not say how the stripe is selected using the gathered numbers. > Ie. what is the criteria like minimum average time, "based on" is too > vague. > Could you please add the following in the change log. Hope this will suffice. ---------- This patch adds new read policy Latency. This policy routes the read I/Os based on the device's average wait time for read requests. The average is calculated by dividing the total wait time for read requests by the total read I/Os processed by the device. This policy uses kernel disk stat to calculate the average, so it needs the kernel stat to be enabled. If in case the kernel stat is disabled the policy uses the stripe 0. This policy can be set through the read_policy sysfs interface as shown below. $ echo latency > /sys/fs/btrfs/<uuid>/read_policy $ cat /sys/fs/btrfs/<uuid>/read_policy pid [latency] device roundrobin This policy won't persist across reboot or mount unmount recycle as of now. Here below are few performance test results with latency compared with pid policy. raid1 fio read 500m ----------------------------------------------------- dev types | nvme+ssd nvme+ssd all-nvme all-nvme read type | random sequential random sequential ------------+------------------------------------------ pid | 744MiB/s 809MiB/s 2225MiB/s 2155MiB/s latency | 2072MiB/s 2008MiB/s 1999MiB/s 1961MiB/s raid10 fio read 500m ----------------------------------------------------- dev types | nvme+ssd nvme+ssd all-nvme all-nvme read type | random sequential random sequential ------------+------------------------------------------ pid | 1282MiB/s 1427MiB/s 2152MiB/s 1969MiB/s latency | 2073MiB/s 1871MiB/s 1975MiB/s 1984MiB/s raid1c3 fio read 500m ----------------------------------------------------- dev types | nvme+ssd nvme+ssd all-nvme all-nvme read type | random sequential random sequential ------------+------------------------------------------ pid | 973MiB/s 955MiB/s 2144MiB/s 1962MiB/s latency | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s raid1c4 fio read 500m ----------------------------------------------------- dev types | nvme+ssd nvme+ssd all-nvme all-nvme read type | random sequential random sequential ------------+------------------------------------------ pid | 1204MiB/s 1221MiB/s 2065MiB/s 1878MiB/s latency | 1990MiB/s 1920MiB/s 1945MiB/s 1865MiB/s In the given fio I/O workload above, it is found that there are fewer I/O merges in case of latency as compared to pid. So in the case of all homogeneous devices pid performance little better. The latency is a better choice in the case of mixed types of devices. Also if any one of the devices is under performing due to intermittent I/Os retries, then the latency policy will automatically use the best available. ----------- >> Example usage: >> echo "latency" > /sys/fs/btrfs/$uuid/read_policy > > Do you have some sample results? I remember you posted something but it > would be good to have that in the changelog too. Thanks for suggesting now I have included it, as above. Also, I can generate a patch reroll with this change log if needed. Please, let me know. Thanks, Anand > >> Signed-off-by: Anand Jain <anand.jain@oracle.com> >> Reviewed-by: Josef Bacik <josef@toxicpanda.com> >> --- >> v4: For btrfs_debug_rl() use fs_info instead of >> device->fs_devices->fs_info. >> >> v3: The block layer commit 0d02129e76ed (block: merge struct block_device and >> struct hd_struct) has changed the first argument in the function >> part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch >> fixes it. >> Commit log updated. >> >> v2: Use btrfs_debug_rl() instead of btrfs_info_rl() >> It is better we have this debug until we test this on at least few >> hardwares. >> Drop the unrelated changes. >> Update change log. >> >> rfc->v1: Drop part_stat_read_all instead use part_stat_read >> Drop inflight >> >> fs/btrfs/sysfs.c | 3 ++- >> fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++ >> fs/btrfs/volumes.h | 2 ++ >> 3 files changed, 42 insertions(+), 1 deletion(-) >> >> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c >> index 4522a1c4cd08..7c0324fe97b2 100644 >> --- a/fs/btrfs/sysfs.c >> +++ b/fs/btrfs/sysfs.c >> @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const char *string) >> return false; >> } >> >> -static const char * const btrfs_read_policy_name[] = { "pid" }; >> +/* Must follow the order as in enum btrfs_read_policy */ >> +static const char * const btrfs_read_policy_name[] = { "pid", "latency" }; >> >> static ssize_t btrfs_read_policy_show(struct kobject *kobj, >> struct kobj_attribute *a, char *buf) >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index 62d6a890fc50..f361f1c87eb6 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -14,6 +14,7 @@ >> #include <linux/semaphore.h> >> #include <linux/uuid.h> >> #include <linux/list_sort.h> >> +#include <linux/part_stat.h> >> #include "misc.h" >> #include "ctree.h" >> #include "extent_map.h" >> @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) >> return ret; >> } >> >> +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info, > > The name btrfs_find_best_stripe should be more descriptive about the > selection criteria. > >> + struct map_lookup *map, int first, >> + int num_stripe) >> +{ >> + u64 est_wait = 0; >> + int best_stripe = 0; >> + int index; >> + >> + for (index = first; index < first + num_stripe; index++) { >> + u64 read_wait; >> + u64 avg_wait = 0; >> + unsigned long read_ios; >> + struct btrfs_device *device = map->stripes[index].dev; >> + >> + read_wait = part_stat_read(device->bdev, nsecs[READ]); > > This should use STAT_READ as this is supposed to be indexing the stats > members. READ is some generic constant with the same value. > >> + read_ios = part_stat_read(device->bdev, ios[READ]); >> + >> + if (read_wait && read_ios && read_wait >= read_ios) >> + avg_wait = div_u64(read_wait, read_ios); >> + else >> + btrfs_debug_rl(fs_info, >> + "devid: %llu avg_wait ZERO read_wait %llu read_ios %lu", >> + device->devid, read_wait, read_ios); >> + >> + if (est_wait == 0 || est_wait > avg_wait) { >> + est_wait = avg_wait; >> + best_stripe = index; >> + } >> + } >> + >> + return best_stripe; >> +} >> + >> static int find_live_mirror(struct btrfs_fs_info *fs_info, >> struct map_lookup *map, int first, >> int dev_replace_is_ongoing) >> @@ -5519,6 +5553,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, >> case BTRFS_READ_POLICY_PID: >> preferred_mirror = first + (current->pid % num_stripes); >> break; >> + case BTRFS_READ_POLICY_LATENCY: >> + preferred_mirror = btrfs_find_best_stripe(fs_info, map, first, >> + num_stripes); >> + break; >> } >> >> if (dev_replace_is_ongoing && >> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h >> index 1997a4649a66..71ba1f0e93f4 100644 >> --- a/fs/btrfs/volumes.h >> +++ b/fs/btrfs/volumes.h >> @@ -222,6 +222,8 @@ enum btrfs_chunk_allocation_policy { >> enum btrfs_read_policy { >> /* Use process PID to choose the stripe */ >> BTRFS_READ_POLICY_PID, >> + /* Find and use device with the lowest latency */ >> + BTRFS_READ_POLICY_LATENCY, >> BTRFS_NR_READ_POLICY, >> }; >> >> -- >> 2.28.0
On Thu, Jan 21, 2021 at 06:10:36PM +0800, Anand Jain wrote: > > > On 20/1/21 8:14 pm, David Sterba wrote: > > On Tue, Jan 19, 2021 at 11:52:05PM -0800, Anand Jain wrote: > >> The read policy type latency routes the read IO based on the historical > >> average wait-time experienced by the read IOs through the individual > >> device. This patch obtains the historical read IO stats from the kernel > >> block layer and calculates its average. > > > > This does not say how the stripe is selected using the gathered numbers. > > Ie. what is the criteria like minimum average time, "based on" is too > > vague. > > > > > Could you please add the following in the change log. Hope this will > suffice. > > ---------- > This patch adds new read policy Latency. This policy routes the read > I/Os based on the device's average wait time for read requests. 'wait time' means the time from io submission to completion > The average is calculated by dividing the total wait time for read > requests by the total read I/Os processed by the device. So this is based on numbers from the entire lifetime of the device? The numbers are IMHO not a reliable source. If unrelated writes increase the read wait time then the device will not be selected until the average is lower than of the other devices. The average can only decrease after there are some fast reads, which is not guaranted to happen and there's no good estimate how long it could take to happen. The tests we all probably do are on a fresh mkfs and with a small workload but the mirror selection logic must work long term. The part_stat numbers could be used but must reflect the time factor, ie. it needs to be some a rolling average or collecting a sample for last N seconds. Bear in mind that this is only a heuristic and we don't need perfect results nor we want to replace io scheduling, so the amont of collected data or the logic should be straightforward. > This policy uses kernel disk stat to calculate the average, so it needs > the kernel stat to be enabled. What is needed to enable it? I see it's always compiled in in block/blk-core.c. > If in case the kernel stat is disabled > the policy uses the stripe 0. > This policy can be set through the read_policy sysfs interface as shown > below. > > $ echo latency > /sys/fs/btrfs/<uuid>/read_policy > $ cat /sys/fs/btrfs/<uuid>/read_policy > pid [latency] device roundrobin > > This policy won't persist across reboot or mount unmount recycle as of > now. > > Here below are few performance test results with latency compared with > pid policy. > > raid1 fio read 500m 500m is really small data size for such measurement > ----------------------------------------------------- > dev types | nvme+ssd nvme+ssd all-nvme all-nvme > read type | random sequential random sequential > ------------+------------------------------------------ > pid | 744MiB/s 809MiB/s 2225MiB/s 2155MiB/s > latency | 2072MiB/s 2008MiB/s 1999MiB/s 1961MiB/s Namely when the device bandwidth is 4x higher. The data size should be scaled up so the whole run takes at least 30 seconds if not a few minutes. Other missing information about the load is the number of threads and if it's buffered or direct io. > raid10 fio read 500m > ----------------------------------------------------- > dev types | nvme+ssd nvme+ssd all-nvme all-nvme > read type | random sequential random sequential > ------------+------------------------------------------ > pid | 1282MiB/s 1427MiB/s 2152MiB/s 1969MiB/s > latency | 2073MiB/s 1871MiB/s 1975MiB/s 1984MiB/s > > > raid1c3 fio read 500m > ----------------------------------------------------- > dev types | nvme+ssd nvme+ssd all-nvme all-nvme > read type | random sequential random sequential > ------------+------------------------------------------ > pid | 973MiB/s 955MiB/s 2144MiB/s 1962MiB/s > latency | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s > > > raid1c4 fio read 500m > ----------------------------------------------------- > dev types | nvme+ssd nvme+ssd all-nvme all-nvme > read type | random sequential random sequential > ------------+------------------------------------------ > pid | 1204MiB/s 1221MiB/s 2065MiB/s 1878MiB/s > latency | 1990MiB/s 1920MiB/s 1945MiB/s 1865MiB/s > > > In the given fio I/O workload above, it is found that there are fewer > I/O merges in case of latency as compared to pid. So in the case of all > homogeneous devices pid performance little better. Yeah switching the device in the middle of a contiguous range could slow it down but as long as it's not "too much", then it's ok. The pid selection is good for multiple threads workload but we also want to make it work with single thread reads, like a simple 'cp'. I tested this policy and with 2G file 'cat file' utilizes only one device, so this is no improvement to the pid policy. A policy based on read latency makes sense but the current implementation does not cover enough workloads.
On 22/1/21 1:52 am, David Sterba wrote: > On Thu, Jan 21, 2021 at 06:10:36PM +0800, Anand Jain wrote: >> >> >> On 20/1/21 8:14 pm, David Sterba wrote: >>> On Tue, Jan 19, 2021 at 11:52:05PM -0800, Anand Jain wrote: >>>> The read policy type latency routes the read IO based on the historical >>>> average wait-time experienced by the read IOs through the individual >>>> device. This patch obtains the historical read IO stats from the kernel >>>> block layer and calculates its average. >>> >>> This does not say how the stripe is selected using the gathered numbers. >>> Ie. what is the criteria like minimum average time, "based on" is too >>> vague. >>> >> >> >> Could you please add the following in the change log. Hope this will >> suffice. >> >> ---------- >> This patch adds new read policy Latency. This policy routes the read >> I/Os based on the device's average wait time for read requests. > > 'wait time' means the time from io submission to completion > Yes, at the block layer. >> The average is calculated by dividing the total wait time for read >> requests by the total read I/Os processed by the device. > > So this is based on numbers from the entire lifetime of the device? No, Kernel stats are in memory only, so it is since boot. > The > numbers are IMHO not a reliable source. If unrelated writes increase the > read wait time then the device will not be selected until the average > is lower than of the other devices. I think it is fair. Because comparison is between the performance of 1. disk-type-A VS disk-type-A OR 2. disk-type-A VS disk-type-B In the scenario #1 above, it does not matter which disk, as both of them provides the same performance (theoretically), which is the most common config. In scenario 2# the user can check the read I/O on the devices, if it is _not_ going to the best performing device by theory, either a reboot or iostat-reset (which I think should be there) shall help. Or if they can't reboot or if iostat-reset is not available, then switching to the read-policy 'device' shall help until they reboot, which is a better alternative than PID, which is unpredictable. Unfortunately, this switching is not automatic (more below). There are drawbacks to this policy. At any point in time, momentarily, a device may get too busy due to _external factors_ such as - multiple partitions on the same device, multiple LUNs on the same HBA, OR if the IRQ is shared by the disk's HBA and the gigabit network card (which has better IRQ priority) so whenever the network is busy, the I/O on the disk slows down (I had an opportunity to investigate such an issue before). So now the latency policy shall switch to the better performing device at such a time. But if the theoretically better performing device is back to its normal speed, yes, unless the device gets the read I/O by some operation (for example, scrub), the policy won't know. This scenario is more crucial for the config type #2 (above). Also, there may be a better alternative to the average wait time (for example, another type of mean-values?) Which I think can be tweaked in the long term when we understand the usage of this policy better. If we account for the inflight commands, there will be more switching in config type #1 (above). More switching leads to fewer I/O mergers and higher cache misses (DMA and storage level) leading to poorer performance. So switching back and forth between devices is not good as well. So stay where they are helps until it performs worst than its mirrored device. > The average can only decrease after there are some fast reads, which is > not guaranted to happen and there's no good estimate how long it could > take to happen. True. Also, there isn't any kernel part-stat reset. Not sure if the block layer will entertain such a patch, but worth a try IMO. What What do you think? However, even if I reset, it's not guaranteed that temporary bad stats can not happen again. Also it's a bit uncertain how to know when will the theoretically better performing device will be back to its good performance. > The tests we all probably do are on a fresh mkfs and with a small > workload but the mirror selection logic must work long term. > I totally agree. So I am not yet recommending this policy for the default. But ut does solve some of the problems very well. > The part_stat numbers could be used but must reflect the time factor, > ie. it needs to be some a rolling average or collecting a sample for > last N seconds. But, I think the problem here is to know when will the theoretically better performing device will be back to its good performance. So for that purpose, the theoretically better performing device must be probed periodically. And there will be cost. > > Bear in mind that this is only a heuristic and we don't need perfect > results nor we want to replace io scheduling, so the amont of collected > data or the logic should be straightforward. > Yeah. If part_stat can provide stat only for past N-mins or so, it will be simpler. During this patch, I looked into the part_stat code it is not straightforward. >> This policy uses kernel disk stat to calculate the average, so it needs >> the kernel stat to be enabled. > > What is needed to enable it? I see it's always compiled in in > block/blk-core.c. > It is enabled by default. But the user may disable part_stat collection at the run time. echo 0 > /sys/block/sdx/queue/iostat >> If in case the kernel stat is disabled >> the policy uses the stripe 0. >> This policy can be set through the read_policy sysfs interface as shown >> below. >> >> $ echo latency > /sys/fs/btrfs/<uuid>/read_policy >> $ cat /sys/fs/btrfs/<uuid>/read_policy >> pid [latency] device roundrobin >> >> This policy won't persist across reboot or mount unmount recycle as of >> now. >> >> Here below are few performance test results with latency compared with >> pid policy. >> >> raid1 fio read 500m > > 500m is really small data size for such measurement > Pls see below about this. >> ----------------------------------------------------- >> dev types | nvme+ssd nvme+ssd all-nvme all-nvme >> read type | random sequential random sequential >> ------------+------------------------------------------ >> pid | 744MiB/s 809MiB/s 2225MiB/s 2155MiB/s >> latency | 2072MiB/s 2008MiB/s 1999MiB/s 1961MiB/s > > Namely when the device bandwidth is 4x higher. The data size should be > scaled up so the whole run takes at least 30 seconds if not a few > minutes. > > Other missing information about the load is the number of threads and if > it's buffered or direct io. > The cover letter has the fio command used. The output from the guest VM is there. From it, I notice the I/Os performed were ~16.8G. I can run the scripts again. Pls, do share with me if you have any ideas for testing. READ: bw=87.0MiB/s (91.2MB/s), 87.0MiB/s-87.0MiB/s (91.2MB/s-91.2MB/s), io=15.6GiB (16.8GB), run=183884-183884msec >> raid10 fio read 500m >> ----------------------------------------------------- >> dev types | nvme+ssd nvme+ssd all-nvme all-nvme >> read type | random sequential random sequential >> ------------+------------------------------------------ >> pid | 1282MiB/s 1427MiB/s 2152MiB/s 1969MiB/s >> latency | 2073MiB/s 1871MiB/s 1975MiB/s 1984MiB/s >> >> >> raid1c3 fio read 500m >> ----------------------------------------------------- >> dev types | nvme+ssd nvme+ssd all-nvme all-nvme >> read type | random sequential random sequential >> ------------+------------------------------------------ >> pid | 973MiB/s 955MiB/s 2144MiB/s 1962MiB/s >> latency | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s >> >> >> raid1c4 fio read 500m >> ----------------------------------------------------- >> dev types | nvme+ssd nvme+ssd all-nvme all-nvme >> read type | random sequential random sequential >> ------------+------------------------------------------ >> pid | 1204MiB/s 1221MiB/s 2065MiB/s 1878MiB/s >> latency | 1990MiB/s 1920MiB/s 1945MiB/s 1865MiB/s >> >> >> In the given fio I/O workload above, it is found that there are fewer >> I/O merges in case of latency as compared to pid. So in the case of all >> homogeneous devices pid performance little better. > > Yeah switching the device in the middle of a contiguous range could slow > it down but as long as it's not "too much", then it's ok. > Yep. > The pid selection is good for multiple threads workload but we also want > to make it work with single thread reads, like a simple 'cp'. > > I tested this policy and with 2G file 'cat file' utilizes only one > device, so this is no improvement to the pid policy. > In the 'cat file' test case above, all the read IOs will go to a single stripe id. But, it does not mean that it will go to the same device. As of now, our chunk allocation is based on the device's free size. So the better thing to do is to have raid 1 on disks of different sizes like, for example, 50G and 100G. Then it guarantees that stripe 0 will be always on the 100G disk. Then it is fair to measure the pid policy. And still, pid policy may perform better, as reading from a single disk is not a bad idea. The read_policy type 'device' proved it. All the policy depends on the workload, so is pid policy. But on top of it the pid policy is non-deterministic which makes it hard to say how it shall be in a known workload. > A policy based on read latency makes sense but the current > implementation does not cover enough workloads. > Yeah. The performances of any policy here (including PID and round- robin) are workload-dependent. IMHO it can't be like one-size-fits and meant to be tuned. Thanks, Anand
>> 500m is really small data size for such measurement
I reran the read policy tests with some changes in the fio command
options. Mainly to measure IOPS throughput and latency on the filesystem
with latency-policy and pid-policy.
Each of these tests was run for 3 iterations and the best and worst of
those 3 iterations are shown below.
These workloads are performing read-write which is the most commonly
used workload, on a single type of device (which is nvme here) and two
devices are configured for RAID1.
In all these read-write workloads, pid-policy performed ~25% better than
the latency-policy for both throughput and IOPS, and 3% better on the
latency parameter.
I haven't analyzed these read-write workloads on RAID1c3/RAID1c4 yet,
but RAID1 is more common than other types, IMO.
So I think pid-policy should remain as our default read policy.
However as shown before, pid-policy perform worst in the case of special
configs such as volumes with mixed types of devices. For those special
mixed types of devices, latency-policy performs better than pid-policy.
As tested before typically latency-policy provided ~175% better
throughput performance in the case of mixed types of devices (SSD and
nvme).
Feedbacks welcome.
Fio logs below.
IOPS focused readwrite workload:
fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=4k
--ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based
--group_reporting --name=iops-randomreadwrite --eta-newline=1
pid [latency] device roundrobin ( 00)
read: IOPS=40.6k, BW=159MiB/s (166MB/s)(18.6GiB/120002msec)
[pid] latency device roundrobin ( 00)
read: IOPS=50.7k, BW=198MiB/s (208MB/s)(23.2GiB/120001msec)
IOPS is 25% better with pid policy.
Throughput focused readwrite workload:
fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=64k
--ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based
--group_reporting --name=throughput-randomreadwrite --eta-newline=1
pid [latency] device roundrobin ( 00)
read: IOPS=8525, BW=533MiB/s (559MB/s)(62.4GiB/120003msec)
[pid] latency device roundrobin ( 00)
read: IOPS=10.7k, BW=670MiB/s (702MB/s)(78.5GiB/120005msec)
Throughput is 25% better with pid policy
Latency focused readwrite workload:
fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=4k
--ioengine=libaio --iodepth=1 --runtime=120 --numjobs=4 --time_based
--group_reporting --name=latency-randomreadwrite --eta-newline=1
pid [latency] device roundrobin ( 00)
read: IOPS=59.8k, BW=234MiB/s (245MB/s)(27.4GiB/120003msec)
lat (usec): min=68, max=826930, avg=1917.20, stdev=4210.90
[pid] latency device roundrobin ( 00)
read: IOPS=61.9k, BW=242MiB/s (253MB/s)(28.3GiB/120001msec)
lat (usec): min=64, max=751557, avg=1846.07, stdev=4082.97
Latency is 3% better with pid policy.
Hi Michal, Did you get any chance to run the evaluation with this patchset? Thanks, Anand On 1/30/2021 9:08 AM, Anand Jain wrote: > >>> 500m is really small data size for such measurement > > I reran the read policy tests with some changes in the fio command > options. Mainly to measure IOPS throughput and latency on the filesystem > with latency-policy and pid-policy. > > Each of these tests was run for 3 iterations and the best and worst of > those 3 iterations are shown below. > > These workloads are performing read-write which is the most commonly > used workload, on a single type of device (which is nvme here) and two > devices are configured for RAID1. > > In all these read-write workloads, pid-policy performed ~25% better than > the latency-policy for both throughput and IOPS, and 3% better on the > latency parameter. > > I haven't analyzed these read-write workloads on RAID1c3/RAID1c4 yet, > but RAID1 is more common than other types, IMO. > > So I think pid-policy should remain as our default read policy. > > However as shown before, pid-policy perform worst in the case of special > configs such as volumes with mixed types of devices. For those special > mixed types of devices, latency-policy performs better than pid-policy. > As tested before typically latency-policy provided ~175% better > throughput performance in the case of mixed types of devices (SSD and > nvme). > > Feedbacks welcome. > > Fio logs below. > > > IOPS focused readwrite workload: > fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=4k > --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based > --group_reporting --name=iops-randomreadwrite --eta-newline=1 > > pid [latency] device roundrobin ( 00) > read: IOPS=40.6k, BW=159MiB/s (166MB/s)(18.6GiB/120002msec) > > [pid] latency device roundrobin ( 00) > read: IOPS=50.7k, BW=198MiB/s (208MB/s)(23.2GiB/120001msec) > > IOPS is 25% better with pid policy. > > > Throughput focused readwrite workload: > fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=64k > --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based > --group_reporting --name=throughput-randomreadwrite --eta-newline=1 > > pid [latency] device roundrobin ( 00) > read: IOPS=8525, BW=533MiB/s (559MB/s)(62.4GiB/120003msec) > > [pid] latency device roundrobin ( 00) > read: IOPS=10.7k, BW=670MiB/s (702MB/s)(78.5GiB/120005msec) > > Throughput is 25% better with pid policy > > Latency focused readwrite workload: > fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=4k > --ioengine=libaio --iodepth=1 --runtime=120 --numjobs=4 --time_based > --group_reporting --name=latency-randomreadwrite --eta-newline=1 > pid [latency] device roundrobin ( 00) > read: IOPS=59.8k, BW=234MiB/s (245MB/s)(27.4GiB/120003msec) > lat (usec): min=68, max=826930, avg=1917.20, stdev=4210.90 > > [pid] latency device roundrobin ( 00) > read: IOPS=61.9k, BW=242MiB/s (253MB/s)(28.3GiB/120001msec) > lat (usec): min=64, max=751557, avg=1846.07, stdev=4082.97 > > Latency is 3% better with pid policy.
On Thu, Feb 04, 2021 at 08:30:01PM +0800, Anand Jain wrote: > > Hi Michal, > > Did you get any chance to run the evaluation with this patchset? > > Thanks, Anand > Hi Anand, Yes, I tested your policies now. Sorry for late response. For the singlethreaded test: [global] name=btrfs-raid1-seqread filename=btrfs-raid1-seqread rw=read bs=64k direct=0 numjobs=1 time_based=0 [file1] size=10G ioengine=libaio results are: - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB) * pid policy READ: bw=215MiB/s (226MB/s), 215MiB/s-215MiB/s (226MB/s-226MB/s), io=10.0GiB (10.7GB), run=47537-47537msec * latency policy READ: bw=219MiB/s (229MB/s), 219MiB/s-219MiB/s (229MB/s-229MB/s), io=10.0GiB (10.7GB), run=46852-46852msec * device policy - didn't test it here, I guess it doesn't make sense to check it on non-mixed arrays ;) - raid1c3 with 2 HDDs and 1 SSD: 2 x Segate Barracuda ST2000DM008 (2TB) 1 x Crucial CT256M550SSD1 (256GB) * pid policy READ: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=10.0GiB (10.7GB), run=46749-46749msec * latency policy READ: bw=517MiB/s (542MB/s), 517MiB/s-517MiB/s (542MB/s-542MB/s), io=10.0GiB (10.7GB), run=19823-19823msec * device policy READ: bw=517MiB/s (542MB/s), 517MiB/s-517MiB/s (542MB/s-542MB/s), io=10.0GiB (10.7GB), run=19810-19810msec For the multithreaded test: [global] name=btrfs-raid1-seqread filename=btrfs-raid1-seqread rw=read bs=64k direct=0 numjobs=1 time_based=0 [file1] size=10G ioengine=libaio results are: - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB) * pid policy READ: bw=1608MiB/s (1686MB/s), 201MiB/s-201MiB/s (211MB/s-211MB/s), io=80.0GiB (85.9GB), run=50948-50949msec * latency policy READ: bw=1515MiB/s (1588MB/s), 189MiB/s-189MiB/s (199MB/s-199MB/s), io=80.0GiB (85.9GB), run=54081-54084msec - raid1c3 with 2 HDDs and 1 SSD: 2 x Segate Barracuda ST2000DM008 (2TB) 1 x Crucial CT256M550SSD1 (256GB) * pid policy READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s), io=80.0GiB (85.9GB), run=44449-44450msec * latency policy READ: bw=4213MiB/s (4417MB/s), 527MiB/s-527MiB/s (552MB/s-552MB/s), io=80.0GiB (85.9GB), run=19444-19446msec * device policy READ: bw=4196MiB/s (4400MB/s), 525MiB/s-525MiB/s (550MB/s-550MB/s), io=80.0GiB (85.9GB), run=19522-19522msec To sum it up - I think that your policies are indeed a very good match for mixed (nonrot and rot) arrays. They perform either slightly better or worse (depending on the test) than pid policy on all-HDD arrays. I've just sent out my proposal of roundrobin policy, which seems to give better performance for all-HDD than your policies (and better than pid policy in all cases): https://patchwork.kernel.org/project/linux-btrfs/patch/20210209203041.21493-7-mrostecki@suse.de/ Cheers, Michal
On 10/02/2021 05:12, Michal Rostecki wrote: > On Thu, Feb 04, 2021 at 08:30:01PM +0800, Anand Jain wrote: >> >> Hi Michal, >> >> Did you get any chance to run the evaluation with this patchset? >> >> Thanks, Anand >> > > Hi Anand, > > Yes, I tested your policies now. Sorry for late response. > > For the singlethreaded test: > > [global] > name=btrfs-raid1-seqread > filename=btrfs-raid1-seqread > rw=read > bs=64k > direct=0 > numjobs=1 > time_based=0 > > [file1] > size=10G > ioengine=libaio > > results are: > > - raid1c3 with 3 HDDs: > 3 x Segate Barracuda ST2000DM008 (2TB) > * pid policy > READ: bw=215MiB/s (226MB/s), 215MiB/s-215MiB/s (226MB/s-226MB/s), > io=10.0GiB (10.7GB), run=47537-47537msec > * latency policy > READ: bw=219MiB/s (229MB/s), 219MiB/s-219MiB/s (229MB/s-229MB/s), > io=10.0GiB (10.7GB), run=46852-46852msec > * device policy - didn't test it here, I guess it doesn't make sense > to check it on non-mixed arrays ;) Hum. device policy provided best performance in non-mixed arrays with fio sequential workload. raid1c3 Read 500m (time = 60sec) ----------------------------------------------------- | nvme+ssd nvme+ssd all-nvme all-nvme | random seq random seq ------------+----------------------------------------- pid | 973MiB/s 955MiB/s 2144MiB/s 1962MiB/s latency | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s device(nvme)| 2021MiB/s 2034MiB/s 1920MiB/s 2132MiB/s roundrobin | 707MiB/s 701MiB/s 1760MiB/s 1990MiB/s > - raid1c3 with 2 HDDs and 1 SSD: > 2 x Segate Barracuda ST2000DM008 (2TB) > 1 x Crucial CT256M550SSD1 (256GB) > * pid policy > READ: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), > io=10.0GiB (10.7GB), run=46749-46749msec > * latency policy > READ: bw=517MiB/s (542MB/s), 517MiB/s-517MiB/s (542MB/s-542MB/s), > io=10.0GiB (10.7GB), run=19823-19823msec > * device policy > READ: bw=517MiB/s (542MB/s), 517MiB/s-517MiB/s (542MB/s-542MB/s), > io=10.0GiB (10.7GB), run=19810-19810msec > > For the multithreaded test: > > [global] > name=btrfs-raid1-seqread > filename=btrfs-raid1-seqread > rw=read > bs=64k > direct=0 > numjobs=1 > time_based=0 > > [file1] > size=10G > ioengine=libaio > > results are: > > - raid1c3 with 3 HDDs: > 3 x Segate Barracuda ST2000DM008 (2TB) > * pid policy > READ: bw=1608MiB/s (1686MB/s), 201MiB/s-201MiB/s (211MB/s-211MB/s), > io=80.0GiB (85.9GB), run=50948-50949msec > * latency policy > READ: bw=1515MiB/s (1588MB/s), 189MiB/s-189MiB/s (199MB/s-199MB/s), > io=80.0GiB (85.9GB), run=54081-54084msec > - raid1c3 with 2 HDDs and 1 SSD: > 2 x Segate Barracuda ST2000DM008 (2TB) > 1 x Crucial CT256M550SSD1 (256GB) > * pid policy > READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s), > io=80.0GiB (85.9GB), run=44449-44450msec > * latency policy > READ: bw=4213MiB/s (4417MB/s), 527MiB/s-527MiB/s (552MB/s-552MB/s), > io=80.0GiB (85.9GB), run=19444-19446msec > * device policy > READ: bw=4196MiB/s (4400MB/s), 525MiB/s-525MiB/s (550MB/s-550MB/s), > io=80.0GiB (85.9GB), run=19522-19522msec > > To sum it up - I think that your policies are indeed a very good match > for mixed (nonrot and rot) arrays. > > They perform either slightly better or worse (depending on the test) > than pid policy on all-HDD arrays. Theoretically, latency would perform better, as the latency parameter works as a feedback loop. Dynamically adjusting itself to the delivered performance. But there is overhead to calculate the latency. Thanks, Anand > I've just sent out my proposal of roundrobin policy, which seems to give > better performance for all-HDD than your policies (and better than pid > policy in all cases): > > https://patchwork.kernel.org/project/linux-btrfs/patch/20210209203041.21493-7-mrostecki@suse.de/ > > Cheers, > Michal >
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index 4522a1c4cd08..7c0324fe97b2 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const char *string) return false; } -static const char * const btrfs_read_policy_name[] = { "pid" }; +/* Must follow the order as in enum btrfs_read_policy */ +static const char * const btrfs_read_policy_name[] = { "pid", "latency" }; static ssize_t btrfs_read_policy_show(struct kobject *kobj, struct kobj_attribute *a, char *buf) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 62d6a890fc50..f361f1c87eb6 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -14,6 +14,7 @@ #include <linux/semaphore.h> #include <linux/uuid.h> #include <linux/list_sort.h> +#include <linux/part_stat.h> #include "misc.h" #include "ctree.h" #include "extent_map.h" @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info, + struct map_lookup *map, int first, + int num_stripe) +{ + u64 est_wait = 0; + int best_stripe = 0; + int index; + + for (index = first; index < first + num_stripe; index++) { + u64 read_wait; + u64 avg_wait = 0; + unsigned long read_ios; + struct btrfs_device *device = map->stripes[index].dev; + + read_wait = part_stat_read(device->bdev, nsecs[READ]); + read_ios = part_stat_read(device->bdev, ios[READ]); + + if (read_wait && read_ios && read_wait >= read_ios) + avg_wait = div_u64(read_wait, read_ios); + else + btrfs_debug_rl(fs_info, + "devid: %llu avg_wait ZERO read_wait %llu read_ios %lu", + device->devid, read_wait, read_ios); + + if (est_wait == 0 || est_wait > avg_wait) { + est_wait = avg_wait; + best_stripe = index; + } + } + + return best_stripe; +} + static int find_live_mirror(struct btrfs_fs_info *fs_info, struct map_lookup *map, int first, int dev_replace_is_ongoing) @@ -5519,6 +5553,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, case BTRFS_READ_POLICY_PID: preferred_mirror = first + (current->pid % num_stripes); break; + case BTRFS_READ_POLICY_LATENCY: + preferred_mirror = btrfs_find_best_stripe(fs_info, map, first, + num_stripes); + break; } if (dev_replace_is_ongoing && diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 1997a4649a66..71ba1f0e93f4 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -222,6 +222,8 @@ enum btrfs_chunk_allocation_policy { enum btrfs_read_policy { /* Use process PID to choose the stripe */ BTRFS_READ_POLICY_PID, + /* Find and use device with the lowest latency */ + BTRFS_READ_POLICY_LATENCY, BTRFS_NR_READ_POLICY, };