Message ID | 8541d6d7a63e470b9f4c22ba95cd64fc@waffle.tech (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: balance RAID1/RAID10 mirror selection | expand |
On 16.10.20 г. 8:59 ч., louis@waffle.tech wrote: > Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads. > > Signed-off-by: Louis Jencka <louis@waffle.tech> Can you show numbers substantiating your claims?
On 2020/10/16 下午3:15, Nikolay Borisov wrote: > > > On 16.10.20 г. 8:59 ч., louis@waffle.tech wrote: >> Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads. >> >> Signed-off-by: Louis Jencka <louis@waffle.tech> > > Can you show numbers substantiating your claims? > And isn't this related to the read policy? IIRC some patches are floating in the mail list to provide the framework of how the read policy should work. Those patches aren't yet merged? Thanks, Qu
October 16, 2020 1:15 AM, "Nikolay Borisov" <nborisov@suse.com> wrote: > On 16.10.20 г. 8:59 ч., louis@waffle.tech wrote: > >> Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double >> throughput for large reads. >> >> Signed-off-by: Louis Jencka <louis@waffle.tech> > > Can you show numbers substantiating your claims? Sure thing. Below are the results from some tests I've run using a Debian 10 VM. It has two 10GiB disks attached which are each independently limited to 100MB/s total IO (using libvirt's iotune feature). They've been used to create the RAID1 volume mounted at /mnt. I've truncated the fio output to just show the high-level stats. Reading a large file performs twice as well with the patch applied (203 MB/s vs 101 MB/s). Writing a large file, and the random read-write tests, look like they perform roughly the same to me. Louis --- Without patch ============= louis@debian:/mnt$ uname -a Linux debian 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64 GNU/Linux louis@debian:/mnt$ btrfs fi df /mnt Data, RAID1: total=7.00GiB, used=2.00MiB System, RAID1: total=8.00MiB, used=16.00KiB Metadata, RAID1: total=1.00GiB, used=688.00KiB GlobalReserve, single: total=29.19MiB, used=352.00KiB louis@debian:/mnt$ dd if=/dev/urandom of=/mnt/test bs=1M count=1024 conv=fdatasync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 12.928 s, 83.1 MB/s louis@debian:/mnt$ dd if=/mnt/test of=/dev/null 2097152+0 records in 2097152+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 10.6403 s, 101 MB/s louis@debian:/mnt$ fio --name=random-rw --ioengine=posixaio --rw=randrw --bs=4k --numjobs=2 --size=2g --iodepth=1 --runtime=60 --time_based --end_fsync=1 random-rw: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1 ... Run status group 0 (all jobs): READ: bw=15.8MiB/s (16.5MB/s), 7459KiB/s-8671KiB/s (7638kB/s-8879kB/s), io=964MiB (1010MB), run=61179-61179msec WRITE: bw=15.8MiB/s (16.6MB/s), 7490KiB/s-8682KiB/s (7670kB/s-8890kB/s), io=966MiB (1013MB), run=61179-61179msec With patch ========== louis@debian:/mnt$ uname -a Linux debian 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1a~test (2020-10-15) x86_64 GNU/Linux louis@debian:/mnt$ dd if=/dev/urandom of=/mnt/test bs=1M count=1024 conv=fdatasync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.8067 s, 90.9 MB/s louis@debian:/mnt$ dd if=/mnt/test of=/dev/null 2097152+0 records in 2097152+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.28642 s, 203 MB/s louis@debian:/mnt$ fio --name=random-rw --ioengine=posixaio --rw=randrw --bs=4k --numjobs=2 --size=2g --iodepth=1 --runtime=60 --time_based --end_fsync=1 random-rw: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1 ... Run status group 0 (all jobs): READ: bw=16.5MiB/s (17.3MB/s), 8217KiB/s-8652KiB/s (8414kB/s-8860kB/s), io=1025MiB (1074MB), run=62202-62202msec WRITE: bw=16.5MiB/s (17.3MB/s), 8218KiB/s-8698KiB/s (8415kB/s-8907kB/s), io=1028MiB (1077MB), run=62202-62202msec
louis@waffle.tech wrote: > Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads. > > Signed-off-by: Louis Jencka <louis@waffle.tech> > --- > fs/btrfs/volumes.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 58b9c419a..45c581d46 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -333,6 +333,9 @@ struct list_head * __attribute_const__ btrfs_get_fs_uuids(void) > return &fs_uuids; > } > > +/* Used for round-robin balancing RAID1/RAID10 reads. */ > +atomic_t rr_counter = ATOMIC_INIT(0); > + > /* > * alloc_fs_devices - allocate struct btrfs_fs_devices > * @fsid: if not NULL, copy the UUID to fs_devices::fsid > @@ -5482,7 +5485,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, > else > num_stripes = map->num_stripes; > > - preferred_mirror = first + current->pid % num_stripes; > + preferred_mirror = first + > + (unsigned)atomic_inc_return(&rr_counter) % num_stripes; > > if (dev_replace_is_ongoing && > fs_info->dev_replace.cont_reading_from_srcdev_mode == > I am just a regular user of BTRFS, but the btrfs_fs_info structure contains a substructure (dirty_metadata_bytes). Assuming that writing data leads to a dirty metadata bytes being set to non-zero would not something along the lines of this potentially be better? if(fs_info->dirty_metadata_bytes) atomic_inc_return(&rr_counter); prefered_mirror = first + (rr_counter+current->pid) % num_stripes; My knowledge of BTRFS code is near zero and I should know better than posting "code" here, but I can't resist getting my point across. There must be a simple way to improve slightly over a regular rr scheme.
October 16, 2020 1:29 AM, "Qu Wenruo" <quwenruo.btrfs@gmx.com> wrote: > On 2020/10/16 下午3:15, Nikolay Borisov wrote: > >> On 16.10.20 г. 8:59 ч., louis@waffle.tech wrote: >>> Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double >>> throughput for large reads. >>> >>> Signed-off-by: Louis Jencka <louis@waffle.tech> >> >> Can you show numbers substantiating your claims? > > And isn't this related to the read policy? IIRC some patches are > floating in the mail list to provide the framework of how the read > policy should work. > > Those patches aren't yet merged? > > Thanks, > Qu Oh, I hadn't seen the read-policy framework thread. Within that, this could be a new policy or a replacement for BTRFS_READ_POLICY_PID. Is work ongoing for those patches? Louis
On 17/10/20 1:28 am, louis@waffle.tech wrote: > October 16, 2020 1:29 AM, "Qu Wenruo" <quwenruo.btrfs@gmx.com> wrote: > >> On 2020/10/16 下午3:15, Nikolay Borisov wrote: >> >>> On 16.10.20 г. 8:59 ч., louis@waffle.tech wrote: >>>> Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double >>>> throughput for large reads. >>>> >>>> Signed-off-by: Louis Jencka <louis@waffle.tech> >>> >>> Can you show numbers substantiating your claims? >> >> And isn't this related to the read policy? IIRC some patches are >> floating in the mail list to provide the framework of how the read >> policy should work. >> >> Those patches aren't yet merged? >> >> Thanks, >> Qu > > Oh, I hadn't seen the read-policy framework thread. Within that, this could be > a new policy or a replacement for BTRFS_READ_POLICY_PID. Is work ongoing for > those patches? > Yes. Those patches are just few days away please stayed tuned. Thanks, Anand > Louis >
Hi, Louis Jencka Can we move 'atomic_t rr_counter' into 'struct btrfs_fs_info' to support multiple mounted btrfs filesystem? Although 'readmirror feature (read_policy sysfs)' is talked about, round-robin is a replacement for BTRFS_READ_POLICY_PID in most case, we no longer need BTRFS_READ_POLICY_PID ? Best Regards Wang Yugui (wangyugui@e16-tech.com) 2020/10/23 > Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads. > > Signed-off-by: Louis Jencka <louis@waffle.tech> > --- > fs/btrfs/volumes.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 58b9c419a..45c581d46 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -333,6 +333,9 @@ struct list_head * __attribute_const__ btrfs_get_fs_uuids(void) > return &fs_uuids; > } > > +/* Used for round-robin balancing RAID1/RAID10 reads. */ > +atomic_t rr_counter = ATOMIC_INIT(0); > + > /* > * alloc_fs_devices - allocate struct btrfs_fs_devices > * @fsid: if not NULL, copy the UUID to fs_devices::fsid > @@ -5482,7 +5485,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, > else > num_stripes = map->num_stripes; > > - preferred_mirror = first + current->pid % num_stripes; > + preferred_mirror = first + > + (unsigned)atomic_inc_return(&rr_counter) % num_stripes; > > if (dev_replace_is_ongoing && > fs_info->dev_replace.cont_reading_from_srcdev_mode ==
On 23.10.20 г. 10:38 ч., Wang Yugui wrote: > Hi, Louis Jencka > > Can we move 'atomic_t rr_counter' into 'struct btrfs_fs_info' to > support multiple mounted btrfs filesystem? > And introduce constant cache line pings for every read. This thing needs to be tested under load with perf to see what kind of overhead the shared atomic_t counter adds. My hunch is this should really be a per-cpu variable. <snip>
Hi, Louis Jencka Cc: Anand Jain Maybe we still need BTRFS_READ_POLICY_PID because of readahead. There are readahead inside OS and readahead inside some disk. For most SSD/SAS and SSD/SATA, there seems readahead inside the disk. But for some SSD/NVMe, there seems NO readahead inside the disk. BTRFS_READ_POLICY_PID cooperates readahead better in some case. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2020/10/27 > Hi, Louis Jencka > > Can we move 'atomic_t rr_counter' into 'struct btrfs_fs_info' to > support multiple mounted btrfs filesystem? > > Although 'readmirror feature (read_policy sysfs)' is talked about, > round-robin is a replacement for BTRFS_READ_POLICY_PID in most case, > we no longer need BTRFS_READ_POLICY_PID ? > > Best Regards > Wang Yugui (wangyugui@e16-tech.com) > 2020/10/23 > > > Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads. > > > > Signed-off-by: Louis Jencka <louis@waffle.tech> > > --- > > fs/btrfs/volumes.c | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index 58b9c419a..45c581d46 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -333,6 +333,9 @@ struct list_head * __attribute_const__ btrfs_get_fs_uuids(void) > > return &fs_uuids; > > } > > > > +/* Used for round-robin balancing RAID1/RAID10 reads. */ > > +atomic_t rr_counter = ATOMIC_INIT(0); > > + > > /* > > * alloc_fs_devices - allocate struct btrfs_fs_devices > > * @fsid: if not NULL, copy the UUID to fs_devices::fsid > > @@ -5482,7 +5485,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, > > else > > num_stripes = map->num_stripes; > > > > - preferred_mirror = first + current->pid % num_stripes; > > + preferred_mirror = first + > > + (unsigned)atomic_inc_return(&rr_counter) % num_stripes; > > > > if (dev_replace_is_ongoing && > > fs_info->dev_replace.cont_reading_from_srcdev_mode == >
Good points. I'll move it into 'struct btrfs_fs_info', look at making it a per-cpu variable, and do some testing with perf (later this week hopefully). Louis October 23, 2020 1:42 AM, "Nikolay Borisov" <nborisov@suse.com> wrote: > On 23.10.20 г. 10:38 ч., Wang Yugui wrote: > >> Hi, Louis Jencka >> >> Can we move 'atomic_t rr_counter' into 'struct btrfs_fs_info' to >> support multiple mounted btrfs filesystem? > > And introduce constant cache line pings for every read. This thing needs > to be tested under load with perf to see what kind of overhead the > shared atomic_t counter adds. My hunch is this should really be a > per-cpu variable. > > <snip>
Ha nevermind, looks like Anand Jain beat me to the punch. Louis November 2, 2020 12:29 PM, louis@waffle.tech wrote: > Good points. I'll move it into 'struct btrfs_fs_info', look at making it a per-cpu variable, and do > some testing with perf (later this week hopefully). > > Louis > > October 23, 2020 1:42 AM, "Nikolay Borisov" <nborisov@suse.com> wrote: > >> On 23.10.20 г. 10:38 ч., Wang Yugui wrote: >> >>> Hi, Louis Jencka >>> >>> Can we move 'atomic_t rr_counter' into 'struct btrfs_fs_info' to >>> support multiple mounted btrfs filesystem? >> >> And introduce constant cache line pings for every read. This thing needs >> to be tested under load with perf to see what kind of overhead the >> shared atomic_t counter adds. My hunch is this should really be a >> per-cpu variable. >> >> <snip>
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 58b9c419a..45c581d46 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -333,6 +333,9 @@ struct list_head * __attribute_const__ btrfs_get_fs_uuids(void) return &fs_uuids; } +/* Used for round-robin balancing RAID1/RAID10 reads. */ +atomic_t rr_counter = ATOMIC_INIT(0); + /* * alloc_fs_devices - allocate struct btrfs_fs_devices * @fsid: if not NULL, copy the UUID to fs_devices::fsid @@ -5482,7 +5485,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, else num_stripes = map->num_stripes; - preferred_mirror = first + current->pid % num_stripes; + preferred_mirror = first + + (unsigned)atomic_inc_return(&rr_counter) % num_stripes; if (dev_replace_is_ongoing && fs_info->dev_replace.cont_reading_from_srcdev_mode ==
Balance RAID1/RAID10 mirror selection via plain round-robin scheduling. This should roughly double throughput for large reads. Signed-off-by: Louis Jencka <louis@waffle.tech> --- fs/btrfs/volumes.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)