[RFC,0/5] readmirror feature
mbox series

Message ID 1552989624-29577-1-git-send-email-anand.jain@oracle.com
Headers show
Series
  • readmirror feature
Related show

Message

Anand Jain March 19, 2019, 10 a.m. UTC
RFC patch as of now, appreciate your comments. This patch set has
been tested.

Function call chain  __btrfs_map_block()->find_live_mirror() uses
thread pid to determine the %mirror_num when the mirror_num=0.
    
The pid based mirror_num extrapolation has following disadvantages
 . A large single-process read IO will read only from one disk.
 . In a worst scenario all processes read accessing the FS could have
       either odd or even pid, the read IO gets skewed.
 . There is no deterministic way of knowing/controlling which copy will
       be used for reading.
 . May see performance variations for a given set of multi process
       workload ran at different times.
    
So we need other types of readmirror policies.
    
This patch introduces a framework so that we can add more policies, and
converts the existing %pid into as a configurable parameter using the
property. And also provides a transient readmirror mount option, so that
this property can be applied for the read io during mount and for
readonly FS.

 For example:
   btrfs property set <mnt> readmirror pid
   btrfs property set <mnt> readmirror ""
   btrfs property set <mnt> readmirror devid<n>

   mount -o readmirror=pid <mnt>
   mount -o readmirror=devid<n> <mnt>

Please note:
 The property storage is an extented attributes of root inode
 (BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY,
 which is open for comments.

 This patch uses the unused btrfs_device::type, and does not use the
 bitwise ops because as type is u64 and bitwise ops need u32, so we
 need 32bits of the btrfs_device::type. Which is a kind of messy.
 Its open for suggestion for any better way.

Anand Jain (5):
  btrfs: add inode pointer to prop_handler::validate()
  btrfs: add readmirror pid property
  btrfs: add readmirror devid property
  btrfs: add readmirror pid mount option
  btrfs: add readmirror devid mount option

 fs/btrfs/props.c   | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/super.c   | 29 ++++++++++++++++
 fs/btrfs/volumes.c | 29 +++++++++++++++-
 fs/btrfs/volumes.h | 11 ++++++
 4 files changed, 163 insertions(+), 5 deletions(-)

Anand Jain (2):
  btrfs-progs: add helper to create xattr name
  btrfs-progs: add readmirror policy

 props.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 68 insertions(+), 7 deletions(-)

Comments

Steven Davies March 21, 2019, 4:26 p.m. UTC | #1
On 2019-03-19 10:00, Anand Jain wrote:
> RFC patch as of now, appreciate your comments. This patch set has
> been tested.
> 
> Function call chain  __btrfs_map_block()->find_live_mirror() uses
> thread pid to determine the %mirror_num when the mirror_num=0.
> 
> The pid based mirror_num extrapolation has following disadvantages
>  . A large single-process read IO will read only from one disk.
>  . In a worst scenario all processes read accessing the FS could have
>        either odd or even pid, the read IO gets skewed.
>  . There is no deterministic way of knowing/controlling which copy will
>        be used for reading.
>  . May see performance variations for a given set of multi process
>        workload ran at different times.
> 
> So we need other types of readmirror policies.
> 
> This patch introduces a framework so that we can add more policies, and
> converts the existing %pid into as a configurable parameter using the
> property. And also provides a transient readmirror mount option, so 
> that
> this property can be applied for the read io during mount and for
> readonly FS.

Is it possible to set this property at mkfs time?

>  For example:
>    btrfs property set <mnt> readmirror pid
>    btrfs property set <mnt> readmirror ""
>    btrfs property set <mnt> readmirror devid<n>
> 
>    mount -o readmirror=pid <mnt>
>    mount -o readmirror=devid<n> <mnt>

This is an edge case but should we be allowed to set more than one 
device as a read mirror in a 3+ device array? In theory there could be 
two fast disks and one slow disk where all stripes are guaranteed to be 
on at least one fast disk.

I'll test these patches out when I have some spare time over the next 
few weeks. Do you have a tree I can pull / what are the patches based 
on?

Way beyond this patch series, considering a 3+ device raid1 array with 
mixed fast and slow disks perhaps there could also be a write preference 
for disks to fill up the fast disks first.

Steven Davies
waxhead March 21, 2019, 6:26 p.m. UTC | #2
Steven Davies wrote:
> On 2019-03-19 10:00, Anand Jain wrote:
>> RFC patch as of now, appreciate your comments. This patch set has
>> been tested.
>>
>> ....
>>
>> This patch introduces a framework so that we can add more policies, and
>> converts the existing %pid into as a configurable parameter using the
>> property. And also provides a transient readmirror mount option, so that
>> this property can be applied for the read io during mount and for
>> readonly FS.
> 
> Is it possible to set this property at mkfs time?
> 
>>  For example:
>>    btrfs property set <mnt> readmirror pid
>>    btrfs property set <mnt> readmirror ""
>>    btrfs property set <mnt> readmirror devid<n>
>>
>>    mount -o readmirror=pid <mnt>
>>    mount -o readmirror=devid<n> <mnt>
> 
> This is an edge case but should we be allowed to set more than one 
> device as a read mirror in a 3+ device array? In theory there could be 
> two fast disks and one slow disk where all stripes are guaranteed to be 
> on at least one fast disk.
> 
> I'll test these patches out when I have some spare time over the next 
> few weeks. Do you have a tree I can pull / what are the patches based on?
> 
> Way beyond this patch series, considering a 3+ device raid1 array with 
> mixed fast and slow disks perhaps there could also be a write preference 
> for disks to fill up the fast disks first.
> 
> Steven Davies

This is more of less a feature request , but it feels right to mention 
this here...

If I remember correctly BTRFS does not currently scale to more than 32 
devices right now (due to some striping related stuff), but if BTRFS can 
have many more devices, and even now at >16 devices would it not make 
sense to be able to "tag" devices or make "device groups".

For example - when (if) subvolumes can have different "RAID" profiles it 
would perhaps make sense to ensure that certain subvolumes would prefer 
to use storage from a certain group of devices. In a mixed setup where 
you have both SSD's, HDD's and those new M2 things (or whatever) it 
could be nice to make a subvolume for /usr or /var and ask BTRFS to 
prefer to store that on a SSD's while a subvolume for /home could be 
preferred to be stored on HDD's.

Having device groups could also allow to define certain storage devices 
for "hot data" so that data that is read often could auto-migrate to the 
faster storage devices.. As I understand BTRFS it is "just" a matter of 
setting a flag pr. chunk so it would prefer to be allocated on device of 
type/group xyz...

In a N-way mirror setup you could would read mostly from SSD's while 
using HDD's for storing the mirror copy.

if I may suggest ... I think something along the lines of...

'btrfs device setgroup DEVID GROUPNAME'
'btrfs property set GROUPNAME readweight=100, writeweight=50'
'btrfs property set GROUPNAME readmirror=whatever_policy'
and
'btrfs subvolume setgroup GROUPNAME'

would to just fine...

Again , just a suggestion from a regular BTRFS user. Nothing more , but 
please consider something like this. The current readmirror idea sounds 
a tad limited as it does not account for subvolumes.
Anand Jain March 22, 2019, 6:08 a.m. UTC | #3
On 3/22/19 12:26 AM, Steven Davies wrote:
> On 2019-03-19 10:00, Anand Jain wrote:
>> RFC patch as of now, appreciate your comments. This patch set has
>> been tested.
>>
>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
>> thread pid to determine the %mirror_num when the mirror_num=0.
>>
>> The pid based mirror_num extrapolation has following disadvantages
>>  . A large single-process read IO will read only from one disk.
>>  . In a worst scenario all processes read accessing the FS could have
>>        either odd or even pid, the read IO gets skewed.
>>  . There is no deterministic way of knowing/controlling which copy will
>>        be used for reading.
>>  . May see performance variations for a given set of multi process
>>        workload ran at different times.
>>
>> So we need other types of readmirror policies.
>>
>> This patch introduces a framework so that we can add more policies, and
>> converts the existing %pid into as a configurable parameter using the
>> property. And also provides a transient readmirror mount option, so that
>> this property can be applied for the read io during mount and for
>> readonly FS.
> 
> Is it possible to set this property at mkfs time?

  Possible.

>>  For example:
>>    btrfs property set <mnt> readmirror pid
>>    btrfs property set <mnt> readmirror ""
>>    btrfs property set <mnt> readmirror devid<n>
>>
>>    mount -o readmirror=pid <mnt>
>>    mount -o readmirror=devid<n> <mnt>
> 
> This is an edge case but should we be allowed to set more than one 
> device as a read mirror in a 3+ device array? In theory there could be 
> two fast disks and one slow disk where all stripes are guaranteed to be 
> on at least one fast disk.

  Right. We need to evolve the device grouping part to get this.

> I'll test these patches out when I have some spare time over the next 
> few weeks. Do you have a tree I can pull / what are the patches based on?

   Thanks for the tests. Sure will upload and send the link.

> Way beyond this patch series, considering a 3+ device raid1 array with 
> mixed fast and slow disks perhaps there could also be a write preference 
> for disks to fill up the fast disks first.

   Right.

Thanks
Anand

> Steven Davies
David Sterba April 9, 2019, 3:48 p.m. UTC | #4
On Tue, Mar 19, 2019 at 06:00:19PM +0800, Anand Jain wrote:
> RFC patch as of now, appreciate your comments. This patch set has
> been tested.
> 
> Function call chain  __btrfs_map_block()->find_live_mirror() uses
> thread pid to determine the %mirror_num when the mirror_num=0.
>     
> The pid based mirror_num extrapolation has following disadvantages
>  . A large single-process read IO will read only from one disk.

Please note that the pid refers to the metadata kernel threads that
submit the IO, so on average the IO is serviced by different threads
with random PID. In the end it gets distributed over multiple drives.
But it's nott ideal, no argument about that.

>  . In a worst scenario all processes read accessing the FS could have
>        either odd or even pid, the read IO gets skewed.
>  . There is no deterministic way of knowing/controlling which copy will
>        be used for reading.
>  . May see performance variations for a given set of multi process
>        workload ran at different times.
>     
> So we need other types of readmirror policies.
>     
> This patch introduces a framework so that we can add more policies, and
> converts the existing %pid into as a configurable parameter using the
> property. And also provides a transient readmirror mount option, so that
> this property can be applied for the read io during mount and for
> readonly FS.

I think the mirror selection by pid was a temporary solution (that
stuck, as it always happens). On average it does not work bad. Once we
have a better way then it can be dropped. So I'm not convinced it needs
to be one of the configuration options.

What exactly do you mean by 'transient'? I see that it could be needed
for mount or read-only, though I don't see the usefulness of that.

>  For example:
>    btrfs property set <mnt> readmirror pid
>    btrfs property set <mnt> readmirror ""
>    btrfs property set <mnt> readmirror devid<n>
> 
>    mount -o readmirror=pid <mnt>
>    mount -o readmirror=devid<n> <mnt>
> 
> Please note:
>  The property storage is an extented attributes of root inode
>  (BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY,
>  which is open for comments.

A new key type is not going to be allocated for that, the extended
attributes for properties are there. Another question is where to attach
the property as this is a whole-filesystem property, we don't have an
example to follow so far.

>  This patch uses the unused btrfs_device::type, and does not use the
>  bitwise ops because as type is u64 and bitwise ops need u32, so we
>  need 32bits of the btrfs_device::type. Which is a kind of messy.
>  Its open for suggestion for any better way.

For a prototype and early testing, using the type is probably ok, but
I don't think it's right for permanent storage. Besides I'm sure that
the u64 will not suffice in the long run.
Anand Jain April 12, 2019, 9:02 a.m. UTC | #5
Thanks for the comments.. more below.

On 9/4/19 11:48 PM, David Sterba wrote:
> On Tue, Mar 19, 2019 at 06:00:19PM +0800, Anand Jain wrote:
>> RFC patch as of now, appreciate your comments. This patch set has
>> been tested.
>>
>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
>> thread pid to determine the %mirror_num when the mirror_num=0.
>>      
>> The pid based mirror_num extrapolation has following disadvantages
>>   . A large single-process read IO will read only from one disk.
> 
> Please note that the pid refers to the metadata kernel threads that
> submit the IO, so on average the IO is serviced by different threads
> with random PID. In the end it gets distributed over multiple drives.
> But it's nott ideal, no argument about that.

  ok.

>>   . In a worst scenario all processes read accessing the FS could have
>>         either odd or even pid, the read IO gets skewed.
>>   . There is no deterministic way of knowing/controlling which copy will
>>         be used for reading.
>>   . May see performance variations for a given set of multi process
>>         workload ran at different times.
>>      
>> So we need other types of readmirror policies.
>>      
>> This patch introduces a framework so that we can add more policies, and
>> converts the existing %pid into as a configurable parameter using the
>> property. And also provides a transient readmirror mount option, so that
>> this property can be applied for the read io during mount and for
>> readonly FS.
> 
> I think the mirror selection by pid was a temporary solution (that
> stuck, as it always happens). On average it does not work bad. Once we
> have a better way then it can be dropped. So I'm not convinced it needs
> to be one of the configuration options.

  Ok. Thanks for confirming, we could replace pid with something better,
  as of now we have devid-based and device-qdepth based, IMO qdepth based
  should be default as I commented in the respective patch.

> What exactly do you mean by 'transient'? I see that it could be needed
> for mount or read-only, though I don't see the usefulness of that.

  I mean to say does not persist across reboot. To override the
  original setting lets say device-qdepth based, but momentarily
  to avoid known read errors from a disk.

>>   For example:
>>     btrfs property set <mnt> readmirror pid
>>     btrfs property set <mnt> readmirror ""
>>     btrfs property set <mnt> readmirror devid<n>
>>
>>     mount -o readmirror=pid <mnt>
>>     mount -o readmirror=devid<n> <mnt>
>>
>> Please note:
>>   The property storage is an extented attributes of root inode
>>   (BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY,
>>   which is open for comments.
> 
> A new key type is not going to be allocated for that, the extended
> attributes for properties are there. Another question is where to attach
> the property as this is a whole-filesystem property, we don't have an
> example to follow so far.

  Ok no key.
  Yep its a whole filesystem property. As of now it goes to the root
  inode as an extended attribute. I find it reasonable.

>>   This patch uses the unused btrfs_device::type, and does not use the
>>   bitwise ops because as type is u64 and bitwise ops need u32, so we
>>   need 32bits of the btrfs_device::type. Which is a kind of messy.
>>   Its open for suggestion for any better way.
> 
> For a prototype and early testing, using the type is probably ok, but
> I don't think it's right for permanent storage. Besides I'm sure that
> the u64 will not suffice in the long run.

Hmm.. I don't know. I thought
   <upper 32 bits> for disk groups
   <lower 32 bits> for disk types
should suffice. Not too sure what I am missing.

Thanks, Anand.