mbox series

[v4,0/3] btrfs: read_policy types latency, device and round-robin

Message ID cover.1611114341.git.anand.jain@oracle.com (mailing list archive)
Headers show
Series btrfs: read_policy types latency, device and round-robin | expand

Message

Anand Jain Jan. 20, 2021, 7:52 a.m. UTC
v4:
Add rb from Josef in patch 1 and 3.
In patch 1/3, use fs_info instead of device->fs_devices->fs_info.
Drop round-robin policy because my workload (fio random) shows no performance
 gains due to fewer merges at the block layer.

v3:
The block layer commit 0d02129e76ed (block: merge struct block_device and
struct hd_struct) has changed the first argument in the function
part_stat_read_all() in 5.11-rc1. So trickle down its changes in the patch 1/4.

v2:
Fixes as per review comments, as in the individual patches.

rfc->v1:
Drop the tracing patch.
Drop the factor associated with the inflight commands (because there
were too many unnecessary switches).
Few C styles fix.

-----

This patchset adds read policy types latency, device, and round-robin, for the
mirrored raid profiles such as raid1, raid1c3, raid1c4, and raid10. The default
read policy remains as PID, as of now.

Read policy types:
Latency:

Latency policy routes the read IO based on the historical average
wait time experienced by the read IOs on the individual device.

Device:

With the device policy along with the read_preferred flag, you can
set the device for reading manually. Useful to test mirrors in a
deterministic way and helps advance system administrations.

Round-robin (RFC patch):

Alternates striped device in a round-robin loop for reading. To achieve
this first we put the stripes in an array, sort it by devid and pick the
next device.

Test scripts:
=============

I have included a few scripts which were useful for testing.

-------------------8<--------------------------------
Set latency policy on the btrfs mounted at /mnt

Usage example:
  $ readpolicyset /mnt latency

Anand Jain (3):
  btrfs: add read_policy latency
  btrfs: introduce new device-state read_preferred
  btrfs: introduce new read_policy device

 fs/btrfs/sysfs.c   | 57 ++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  5 ++++
 3 files changed, 121 insertions(+), 1 deletion(-)

Comments

Anand Jain Jan. 22, 2021, 5:52 a.m. UTC | #1
[Oops. A part of the cover letter is missing again. The cover-letter
  file has it all. I am not sure why it happened.
  Here below, I am just sending it by email].

v4:
Add rb from Josef in patch 1 and 3.
In patch 1/3, use fs_info instead of device->fs_devices->fs_info.
Drop round-robin policy because my workload (fio random) shows no
performance gains due to fewer merges at the block layer.

v3:
The block layer commit 0d02129e76ed (block: merge struct block_device
and struct hd_struct) has changed the first argument in the function
part_stat_read_all() in 5.11-rc1. So trickle down its changes in the 
patch 1/4.

v2:
Fixes as per review comments, as in the individual patches.

rfc->v1:
Drop the tracing patch.
Drop the factor associated with the inflight commands (because there
were too many unnecessary switches).
Few C styles fix.

-----

This patchset adds read policy types latency, device, and round-robin,
for the mirrored raid profiles such as raid1, raid1c3, raid1c4, and 
raid10. The default read policy remains as PID, as of now.

Read policy types:
Latency:

Latency policy routes the read IO based on the historical average
wait time experienced by the read IOs on the individual device.

Device:

With the device policy along with the read_preferred flag, you can
set the device for reading manually. Useful to test mirrors in a
deterministic way and helps advance system administrations.

Round-robin (RFC patch, removed in v4):

Alternates striped device in a round-robin loop for reading. To achieve
this first we put the stripes in an array, sort it by devid and pick the
next device.

Test scripts:
=============

I have included a few scripts which were useful for testing.

-------------------8<--------------------------------
Set latency policy on the btrfs mounted at /mnt

Usage example:
   $ readpolicyset /mnt latency

$ cat readpolicyset
#!/bin/bash

: ${1?"arg1 <mnt> missing"}
: ${2?"arg2 <pid|latency|device|roundrobin> missing"}

mnt=$1
policy=$2
[ $policy == "device" ] && { : ${3?"arg3 <devid> missing"}; }
devid=$3

uuid=$(btrfs fi show -m /btrfs | grep uuid | awk '{print $4}')
p=/sys/fs/btrfs/$uuid/read_policy
q=/sys/fs/btrfs/$uuid/devinfo

[ $policy == "device" ] && { echo 1 > ${q}/$devid/read_preferred || exit 
$?; }

echo $policy > $p
exit $?

-------------------8<--------------------------------

Read policy type from the btrfs mounted at /mnt

Usage example:
   $ readpolicy /mnt

$ cat readpolicy
#!/bin/bash

: ${1?"arg1 <mnt> missing"}
mnt=$1

uuid=$(btrfs fi show -m /btrfs | grep uuid | awk '{print $4}')
p=/sys/fs/btrfs/$uuid/read_policy
q=/sys/fs/btrfs/$uuid/devinfo

policy=$(cat $p)
echo -n "$policy ( "

for i in $(find $q -type f -name read_preferred | xargs cat)
do
	echo -n "$i"
done
echo ")"
-------------------8<--------------------------------

Show the number of read IO per devices for the give command.

Usage example:
    $ readstat /mnt fioread

$ cat readstat
#!/bin/bash

: ${1?"arg1 <mnt> is missing"}
: ${2?"arg2 <cmd-to-run> is missing"}

mnt=$1; shift
mountpoint -q $mnt || { echo "ERROR: $mnt is not mounted"; exit 1; }

declare -A devread

for dev in $(btrfs filesystem show -m $mnt | grep devid |awk '{print $8}')
do
	prefix=$(echo $dev | rev | cut -d"/" -f1 | rev)
	sysfs_path=$(find /sys | grep $prefix/stat$)

	devread[$sysfs_path]=$(cat $sysfs_path | awk '{print $1}')
done

"$@" | grep "READ: bw"

echo
echo
for sysfs_path in ${!devread[@]}
do
	dev=$(echo $sysfs_path | rev | cut -d"/" -f2 | rev)
	new=$(cat $sysfs_path | awk '{print $1}')
	old=${devread[$sysfs_path]}
	echo "$dev $((new - old))"
done
-------------------8<--------------------------------

Run fio read command

Usage example:
     $ touch /mnt/largefile
     $ fioread /mnt/largefile 500m

$ cat fioread
#!/bin/bash

: ${1?"arg1 </mnt/file> is missing"}
: ${2?"arg2 <1Gi|50Gi> is missing"}

tf=$1
sz=$2
mnt=$(stat -c '%m' $tf)

fio \
--filename=$tf \
--directory=$mnt \
--filesize=$sz \
--size=$sz \
--rw=randread \
--bs=64k \
--ioengine=libaio \
--direct=1 \
--numjobs=32 \
--group_reporting \
--thread \
--name iops-test-job
-------------------8<--------------------------------


Testing on guest VM
~~~~~~~~~~~~~~~~~~~

The test results from my VM with 2 devices of type sata and 2 devices of
type virtio, are below. Performance results are for raid1c4, raid10, and
raid1 are as below.

The workload is fio read 32 threads, 500m random reads.

Fio is passed to the script called readstat, which returns the number of
read IOs per device sent during the fio.

Supporting fio logs are below. And readstat shows the number of read IOs
to the devices (excluding the merges).

raid1c4
=======

pid
----

$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat 
/btrfs fioread /btrfs/largefile 500m

[pid] latency device roundrobin ( 0000)
  READ: bw=87.0MiB/s (91.2MB/s), 87.0MiB/s-87.0MiB/s 
(91.2MB/s-91.2MB/s), io=15.6GiB (16.8GB), run=183884-183884msec

vdb 64060
vdc 64053
sdb 64072
sda 64054


latency
-------

(All devices are non-rotational, but sda and sdb are of type sata and 
vdb and vdc are of type virtio).

$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid [latency] device roundrobin ( 0000)
  READ: bw=87.1MiB/s (91.3MB/s), 87.1MiB/s-87.1MiB/s 
(91.3MB/s-91.3MB/s), io=15.6GiB (16.8GB), run=183774-183774msec

vdb 255844
vdc 559
sdb 0
sda 93


roundrobin
----------

$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid latency device [roundrobin] ( 0000)
  READ: bw=51.0MiB/s (54.5MB/s), 51.0MiB/s-51.0MiB/s 
(54.5MB/s-54.5MB/s), io=15.6GiB (16.8GB), run=307755-307755msec

vdb 866859
vdc 866651
sdb 864139
sda 865533


raid10
======

pid
---

$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat 
/btrfs fioread /btrfs/largefile 500m

[pid] latency device roundrobin ( 0000)
  READ: bw=85.2MiB/s (89.3MB/s), 85.2MiB/s-85.2MiB/s 
(89.3MB/s-89.3MB/s), io=15.6GiB (16.8GB), run=187864-187864msec


sdf 64053
sde 64036
sdd 64043
sdc 64038


latency
-------

$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid [latency] device roundrobin ( 0000)
  READ: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s 
(89.5MB/s-89.5MB/s), io=15.6GiB (16.8GB), run=187370-187370msec


sdf 117494
sde 10748
sdd 125247
sdc 2921

roundrobin
----------

$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid latency device [roundrobin] ( 0000)
  READ: bw=55.4MiB/s (58.1MB/s), 55.4MiB/s-55.4MiB/s 
(58.1MB/s-58.1MB/s), io=15.6GiB (16.8GB), run=288701-288701msec

sdf 617593
sde 617381
sdd 618486
sdc 618633


raid1
=====

pid
----

$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat 
/btrfs fioread /btrfs/largefile 500m

[pid] latency device roundrobin ( 00)
  READ: bw=78.8MiB/s (82.6MB/s), 78.8MiB/s-78.8MiB/s 
(82.6MB/s-82.6MB/s), io=15.6GiB (16.8GB), run=203158-203158msec

sdb 128087
sda 128090


latency
--------

$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid [latency] device roundrobin ( 00)
  READ: bw=86.5MiB/s (90.7MB/s), 86.5MiB/s-86.5MiB/s 
(90.7MB/s-90.7MB/s), io=15.6GiB (16.8GB), run=185023-185023msec

sdb 567
sda 255942


device
-------

(From the latency test results (above) we know sda is providing low 
latency read
IO. So set sda as read preferred device.)

$ readpolicyset /btrfs device 1 && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid latency [device] roundrobin ( 10)
  READ: bw=88.2MiB/s (92.5MB/s), 88.2MiB/s-88.2MiB/s 
(92.5MB/s-92.5MB/s), io=15.6GiB (16.8GB), run=181374-181374msec

sdb 0
sda 256191


roundrobin
-----------

$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid latency device [roundrobin] ( 00)
  READ: bw=54.1MiB/s (56.7MB/s), 54.1MiB/s-54.1MiB/s 
(56.7MB/s-56.7MB/s), io=15.6GiB (16.8GB), run=295693-295693msec

sdb 1252584
sda 1254258


Testing on real hardware:
~~~~~~~~~~~~~~~~~~~~~~~~

raid1 Read 500m
-----------------------------------------------------
             |nvme+ssd  nvme+ssd   all-nvme  all-nvme
             |random    sequential random    sequential
------------+------------------------------------------
pid         | 744MiB/s  809MiB/s  2225MiB/s 2155MiB/s
latency     |2072MiB/s 2008MiB/s  1999MiB/s 1961MiB/s
device(nvme)|2187MiB/s 2063MiB/s  2125MiB/s 2080MiB/s
roundrobin  | 527MiB/s  519MiB/s  2137MiB/s 1876MiB/s


raid10 Read 500m
-----------------------------------------------------
             | nvme+ssd  nvme+ssd  all-nvme  all-nvme
             | random    seq       random    seq
------------+-----------------------------------------
pid         | 1282MiB/s 1427MiB/s 2152MiB/s 1969MiB/s
latency     | 2073MiB/s 1871MiB/s 1975MiB/s 1984MiB/s
device(nvme)| 2447MiB/s 1873MiB/s 2184MiB/s 2015MiB/s
roundrobin  | 1117MiB/s 1076MiB/s 2020MiB/s 2030MiB/s


raid1c3 Read 500m
-----------------------------------------------------
             | nvme+ssd  nvme+ssd  all-nvme  all-nvme
             | random    seq       random    seq
------------+-----------------------------------------
pid         |  973MiB/s  955MiB/s 2144MiB/s 1962MiB/s
latency     | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s
device(nvme)| 2021MiB/s 2034MiB/s 1920MiB/s 2132MiB/s
roundrobin  |  707MiB/s  701MiB/s 1760MiB/s 1990MiB/s


raid1c4 Read 500m
-----------------------------------------------------
             | nvme+ssd  nvme+ssd  all-nvme  all-nvme
             | random    seq       random    seq
------------+----------------------------------------
pid         | 1204MiB/s 1221MiB/s 2065MiB/s 1878MiB/s
latency     | 1990MiB/s 1920MiB/s 1945MiB/s 1865MiB/s
device(nvme)| 2109MiB/s 1935MiB/s 2153MiB/s 1991MiB/s
roundrobin  |  887MiB/s  865MiB/s 1948MiB/s 1796MiB/s


Observations:
=============

1.
As our chunk allocation is based on the device's available size
at that time. So stripe 0 may be circulating among the devices.
So a single-threaded process running with a constant PID, may balance
the read IO among devices. But it is not guaranteed to work in all the
cases, and it might not work very well in the case of raid1c3/4. Further,
PID provides terrible performance if the devices are heterogeneous in
terms of either type, speed, or size.

2.
Latency provides performance equal to PID if all devices are of same
type. Latency needs iostat be enabled and includes cost of calculating
the avg. wait time. So if you factor in a similar cost of calculating the
avg. wait time in case of PID policy (using the debug code [2]) then the
Latency performance is better than PID. This proves that read IO
distribution as per latency is working, but there is a cost to it. And
moreover, latency works for any type of devices.

3.
Round robin is worst (unless there is a bug in my patch). The total
number of new IOs issued is almost double when compared with the PID and
Latency read_policy, that's because there were fewer number of IO merges
in the block layer due to constant switching of devices in the btrfs.

4.
4.
Device read_policy is useful in testing and provides advanced sysadmin
capabilities. When known how to use, the policy could help avert
performance degradation due to csum/IO errors at production.

Thanks, Anand

------------------
[2] Debug patch to factor the cost of calculating the latency per IO.

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d3023879bdf6..72ec633e9063 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5665,6 +5665,12 @@ static int find_live_mirror(struct btrfs_fs_info 
*fs_info,
  fs_info->fs_devices->read_policy = BTRFS_READ_POLICY_PID;
  fallthrough;
  case BTRFS_READ_POLICY_PID:
+ /*
+ * Just to factor in the cost of calculating the avg wait using
+ * iostat, call btrfs_find_best_stripe() here for the PID policy
+ * and drop its results on the floor.
+ */
+ btrfs_find_best_stripe(fs_info, map, first, num_stripes, log,
+ logsz);
  preferred_mirror = first + current->pid % num_stripes;
  scnprintf(log, logsz,
  "first %d num_stripe %d %s (%d) preferred %d",
-------------------------



On 20/1/21 8:34 pm, Anand Jain wrote:
> [Only some parts of the cover-letter went through, tying again.].
> 
> v4:
> Add rb from Josef in patch 1 and 3.
> In patch 1/3, use fs_info instead of device->fs_devices->fs_info.
> Drop round-robin policy because my workload (fio random) shows no performance
>   gains due to fewer merges at the block layer.
> 
> v3:
> The block layer commit 0d02129e76ed (block: merge struct block_device and
> struct hd_struct) has changed the first argument in the function
> part_stat_read_all() in 5.11-rc1. So trickle down its changes in the patch 1/4.
> 
> v2:
> Fixes as per review comments, as in the individual patches.
> 
> rfc->v1:
> Drop the tracing patch.
> Drop the factor associated with the inflight commands (because there
> were too many unnecessary switches).
> Few C styles fix.
> 
> -----
> 
> This patchset adds read policy types latency, device, and round-robin, for the
> mirrored raid profiles such as raid1, raid1c3, raid1c4, and raid10. The default
> read policy remains as PID, as of now.
> 
> Read policy types:
> Latency:
> 
> Latency policy routes the read IO based on the historical average
> wait time experienced by the read IOs on the individual device.
> 
> Device:
> 
> With the device policy along with the read_preferred flag, you can
> set the device for reading manually. Useful to test mirrors in a
> deterministic way and helps advance system administrations.
> 
> Round-robin (RFC patch):
> 
> Alternates striped device in a round-robin loop for reading. To achieve
> this first we put the stripes in an array, sort it by devid and pick the
> next device.
> 
> Test scripts:
> =============
> 
> I have included a few scripts which were useful for testing.
> 
> -------------------8<--------------------------------
> Set latency policy on the btrfs mounted at /mnt
> 
> Usage example:
>    $ readpolicyset /mnt latency
> 
> Anand Jain (3):
>    btrfs: add read_policy latency
>    btrfs: introduce new device-state read_preferred
>    btrfs: introduce new read_policy device
> 
>   fs/btrfs/sysfs.c   | 57 ++++++++++++++++++++++++++++++++++++++++++-
>   fs/btrfs/volumes.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  5 ++++
>   3 files changed, 121 insertions(+), 1 deletion(-)
>