| Message ID | 1552989624-29577-1-git-send-email-anand.jain@oracle.com |
|---|---|
| Headers | show |
| Series |
|
| Related | show |
On 2019-03-19 10:00, Anand Jain wrote: > RFC patch as of now, appreciate your comments. This patch set has > been tested. > > Function call chain __btrfs_map_block()->find_live_mirror() uses > thread pid to determine the %mirror_num when the mirror_num=0. > > The pid based mirror_num extrapolation has following disadvantages > . A large single-process read IO will read only from one disk. > . In a worst scenario all processes read accessing the FS could have > either odd or even pid, the read IO gets skewed. > . There is no deterministic way of knowing/controlling which copy will > be used for reading. > . May see performance variations for a given set of multi process > workload ran at different times. > > So we need other types of readmirror policies. > > This patch introduces a framework so that we can add more policies, and > converts the existing %pid into as a configurable parameter using the > property. And also provides a transient readmirror mount option, so > that > this property can be applied for the read io during mount and for > readonly FS. Is it possible to set this property at mkfs time? > For example: > btrfs property set <mnt> readmirror pid > btrfs property set <mnt> readmirror "" > btrfs property set <mnt> readmirror devid<n> > > mount -o readmirror=pid <mnt> > mount -o readmirror=devid<n> <mnt> This is an edge case but should we be allowed to set more than one device as a read mirror in a 3+ device array? In theory there could be two fast disks and one slow disk where all stripes are guaranteed to be on at least one fast disk. I'll test these patches out when I have some spare time over the next few weeks. Do you have a tree I can pull / what are the patches based on? Way beyond this patch series, considering a 3+ device raid1 array with mixed fast and slow disks perhaps there could also be a write preference for disks to fill up the fast disks first. Steven Davies
Steven Davies wrote: > On 2019-03-19 10:00, Anand Jain wrote: >> RFC patch as of now, appreciate your comments. This patch set has >> been tested. >> >> .... >> >> This patch introduces a framework so that we can add more policies, and >> converts the existing %pid into as a configurable parameter using the >> property. And also provides a transient readmirror mount option, so that >> this property can be applied for the read io during mount and for >> readonly FS. > > Is it possible to set this property at mkfs time? > >> For example: >> btrfs property set <mnt> readmirror pid >> btrfs property set <mnt> readmirror "" >> btrfs property set <mnt> readmirror devid<n> >> >> mount -o readmirror=pid <mnt> >> mount -o readmirror=devid<n> <mnt> > > This is an edge case but should we be allowed to set more than one > device as a read mirror in a 3+ device array? In theory there could be > two fast disks and one slow disk where all stripes are guaranteed to be > on at least one fast disk. > > I'll test these patches out when I have some spare time over the next > few weeks. Do you have a tree I can pull / what are the patches based on? > > Way beyond this patch series, considering a 3+ device raid1 array with > mixed fast and slow disks perhaps there could also be a write preference > for disks to fill up the fast disks first. > > Steven Davies This is more of less a feature request , but it feels right to mention this here... If I remember correctly BTRFS does not currently scale to more than 32 devices right now (due to some striping related stuff), but if BTRFS can have many more devices, and even now at >16 devices would it not make sense to be able to "tag" devices or make "device groups". For example - when (if) subvolumes can have different "RAID" profiles it would perhaps make sense to ensure that certain subvolumes would prefer to use storage from a certain group of devices. In a mixed setup where you have both SSD's, HDD's and those new M2 things (or whatever) it could be nice to make a subvolume for /usr or /var and ask BTRFS to prefer to store that on a SSD's while a subvolume for /home could be preferred to be stored on HDD's. Having device groups could also allow to define certain storage devices for "hot data" so that data that is read often could auto-migrate to the faster storage devices.. As I understand BTRFS it is "just" a matter of setting a flag pr. chunk so it would prefer to be allocated on device of type/group xyz... In a N-way mirror setup you could would read mostly from SSD's while using HDD's for storing the mirror copy. if I may suggest ... I think something along the lines of... 'btrfs device setgroup DEVID GROUPNAME' 'btrfs property set GROUPNAME readweight=100, writeweight=50' 'btrfs property set GROUPNAME readmirror=whatever_policy' and 'btrfs subvolume setgroup GROUPNAME' would to just fine... Again , just a suggestion from a regular BTRFS user. Nothing more , but please consider something like this. The current readmirror idea sounds a tad limited as it does not account for subvolumes.
On 3/22/19 12:26 AM, Steven Davies wrote: > On 2019-03-19 10:00, Anand Jain wrote: >> RFC patch as of now, appreciate your comments. This patch set has >> been tested. >> >> Function call chain __btrfs_map_block()->find_live_mirror() uses >> thread pid to determine the %mirror_num when the mirror_num=0. >> >> The pid based mirror_num extrapolation has following disadvantages >> . A large single-process read IO will read only from one disk. >> . In a worst scenario all processes read accessing the FS could have >> either odd or even pid, the read IO gets skewed. >> . There is no deterministic way of knowing/controlling which copy will >> be used for reading. >> . May see performance variations for a given set of multi process >> workload ran at different times. >> >> So we need other types of readmirror policies. >> >> This patch introduces a framework so that we can add more policies, and >> converts the existing %pid into as a configurable parameter using the >> property. And also provides a transient readmirror mount option, so that >> this property can be applied for the read io during mount and for >> readonly FS. > > Is it possible to set this property at mkfs time? Possible. >> For example: >> btrfs property set <mnt> readmirror pid >> btrfs property set <mnt> readmirror "" >> btrfs property set <mnt> readmirror devid<n> >> >> mount -o readmirror=pid <mnt> >> mount -o readmirror=devid<n> <mnt> > > This is an edge case but should we be allowed to set more than one > device as a read mirror in a 3+ device array? In theory there could be > two fast disks and one slow disk where all stripes are guaranteed to be > on at least one fast disk. Right. We need to evolve the device grouping part to get this. > I'll test these patches out when I have some spare time over the next > few weeks. Do you have a tree I can pull / what are the patches based on? Thanks for the tests. Sure will upload and send the link. > Way beyond this patch series, considering a 3+ device raid1 array with > mixed fast and slow disks perhaps there could also be a write preference > for disks to fill up the fast disks first. Right. Thanks Anand > Steven Davies
On Tue, Mar 19, 2019 at 06:00:19PM +0800, Anand Jain wrote: > RFC patch as of now, appreciate your comments. This patch set has > been tested. > > Function call chain __btrfs_map_block()->find_live_mirror() uses > thread pid to determine the %mirror_num when the mirror_num=0. > > The pid based mirror_num extrapolation has following disadvantages > . A large single-process read IO will read only from one disk. Please note that the pid refers to the metadata kernel threads that submit the IO, so on average the IO is serviced by different threads with random PID. In the end it gets distributed over multiple drives. But it's nott ideal, no argument about that. > . In a worst scenario all processes read accessing the FS could have > either odd or even pid, the read IO gets skewed. > . There is no deterministic way of knowing/controlling which copy will > be used for reading. > . May see performance variations for a given set of multi process > workload ran at different times. > > So we need other types of readmirror policies. > > This patch introduces a framework so that we can add more policies, and > converts the existing %pid into as a configurable parameter using the > property. And also provides a transient readmirror mount option, so that > this property can be applied for the read io during mount and for > readonly FS. I think the mirror selection by pid was a temporary solution (that stuck, as it always happens). On average it does not work bad. Once we have a better way then it can be dropped. So I'm not convinced it needs to be one of the configuration options. What exactly do you mean by 'transient'? I see that it could be needed for mount or read-only, though I don't see the usefulness of that. > For example: > btrfs property set <mnt> readmirror pid > btrfs property set <mnt> readmirror "" > btrfs property set <mnt> readmirror devid<n> > > mount -o readmirror=pid <mnt> > mount -o readmirror=devid<n> <mnt> > > Please note: > The property storage is an extented attributes of root inode > (BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY, > which is open for comments. A new key type is not going to be allocated for that, the extended attributes for properties are there. Another question is where to attach the property as this is a whole-filesystem property, we don't have an example to follow so far. > This patch uses the unused btrfs_device::type, and does not use the > bitwise ops because as type is u64 and bitwise ops need u32, so we > need 32bits of the btrfs_device::type. Which is a kind of messy. > Its open for suggestion for any better way. For a prototype and early testing, using the type is probably ok, but I don't think it's right for permanent storage. Besides I'm sure that the u64 will not suffice in the long run.
Thanks for the comments.. more below. On 9/4/19 11:48 PM, David Sterba wrote: > On Tue, Mar 19, 2019 at 06:00:19PM +0800, Anand Jain wrote: >> RFC patch as of now, appreciate your comments. This patch set has >> been tested. >> >> Function call chain __btrfs_map_block()->find_live_mirror() uses >> thread pid to determine the %mirror_num when the mirror_num=0. >> >> The pid based mirror_num extrapolation has following disadvantages >> . A large single-process read IO will read only from one disk. > > Please note that the pid refers to the metadata kernel threads that > submit the IO, so on average the IO is serviced by different threads > with random PID. In the end it gets distributed over multiple drives. > But it's nott ideal, no argument about that. ok. >> . In a worst scenario all processes read accessing the FS could have >> either odd or even pid, the read IO gets skewed. >> . There is no deterministic way of knowing/controlling which copy will >> be used for reading. >> . May see performance variations for a given set of multi process >> workload ran at different times. >> >> So we need other types of readmirror policies. >> >> This patch introduces a framework so that we can add more policies, and >> converts the existing %pid into as a configurable parameter using the >> property. And also provides a transient readmirror mount option, so that >> this property can be applied for the read io during mount and for >> readonly FS. > > I think the mirror selection by pid was a temporary solution (that > stuck, as it always happens). On average it does not work bad. Once we > have a better way then it can be dropped. So I'm not convinced it needs > to be one of the configuration options. Ok. Thanks for confirming, we could replace pid with something better, as of now we have devid-based and device-qdepth based, IMO qdepth based should be default as I commented in the respective patch. > What exactly do you mean by 'transient'? I see that it could be needed > for mount or read-only, though I don't see the usefulness of that. I mean to say does not persist across reboot. To override the original setting lets say device-qdepth based, but momentarily to avoid known read errors from a disk. >> For example: >> btrfs property set <mnt> readmirror pid >> btrfs property set <mnt> readmirror "" >> btrfs property set <mnt> readmirror devid<n> >> >> mount -o readmirror=pid <mnt> >> mount -o readmirror=devid<n> <mnt> >> >> Please note: >> The property storage is an extented attributes of root inode >> (BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY, >> which is open for comments. > > A new key type is not going to be allocated for that, the extended > attributes for properties are there. Another question is where to attach > the property as this is a whole-filesystem property, we don't have an > example to follow so far. Ok no key. Yep its a whole filesystem property. As of now it goes to the root inode as an extended attribute. I find it reasonable. >> This patch uses the unused btrfs_device::type, and does not use the >> bitwise ops because as type is u64 and bitwise ops need u32, so we >> need 32bits of the btrfs_device::type. Which is a kind of messy. >> Its open for suggestion for any better way. > > For a prototype and early testing, using the type is probably ok, but > I don't think it's right for permanent storage. Besides I'm sure that > the u64 will not suffice in the long run. Hmm.. I don't know. I thought <upper 32 bits> for disk groups <lower 32 bits> for disk types should suffice. Not too sure what I am missing. Thanks, Anand.
RFC patch as of now, appreciate your comments. This patch set has been tested. Function call chain __btrfs_map_block()->find_live_mirror() uses thread pid to determine the %mirror_num when the mirror_num=0. The pid based mirror_num extrapolation has following disadvantages . A large single-process read IO will read only from one disk. . In a worst scenario all processes read accessing the FS could have either odd or even pid, the read IO gets skewed. . There is no deterministic way of knowing/controlling which copy will be used for reading. . May see performance variations for a given set of multi process workload ran at different times. So we need other types of readmirror policies. This patch introduces a framework so that we can add more policies, and converts the existing %pid into as a configurable parameter using the property. And also provides a transient readmirror mount option, so that this property can be applied for the read io during mount and for readonly FS. For example: btrfs property set <mnt> readmirror pid btrfs property set <mnt> readmirror "" btrfs property set <mnt> readmirror devid<n> mount -o readmirror=pid <mnt> mount -o readmirror=devid<n> <mnt> Please note: The property storage is an extented attributes of root inode (BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY, which is open for comments. This patch uses the unused btrfs_device::type, and does not use the bitwise ops because as type is u64 and bitwise ops need u32, so we need 32bits of the btrfs_device::type. Which is a kind of messy. Its open for suggestion for any better way. Anand Jain (5): btrfs: add inode pointer to prop_handler::validate() btrfs: add readmirror pid property btrfs: add readmirror devid property btrfs: add readmirror pid mount option btrfs: add readmirror devid mount option fs/btrfs/props.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++--- fs/btrfs/super.c | 29 ++++++++++++++++ fs/btrfs/volumes.c | 29 +++++++++++++++- fs/btrfs/volumes.h | 11 ++++++ 4 files changed, 163 insertions(+), 5 deletions(-) Anand Jain (2): btrfs-progs: add helper to create xattr name btrfs-progs: add readmirror policy props.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 68 insertions(+), 7 deletions(-)