Message ID | 20190826090438.7044-1-anand.jain@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | readmirror feature | expand |
On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: > Function call chain __btrfs_map_block()->find_live_mirror() uses > thread pid to determine the %mirror_num when the mirror_num=0. > > This patch introduces a framework so that we can add policies to determine > the %mirror_num. And also adds the devid as the readmirror policy. > > The new property is stored as an item in the device tree as show below. > (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) > > To be able to set and get this new property also introduces new ioctls > BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument > is defined as > struct btrfs_ioctl_readmirror_args { > __u64 type; /* RW */ > __u64 device_bitmap; /* RW */ > } I don't remember if there was a suggestion to use ioctls for read mirror, but the property interface should be sufficient. Besides this ioctl interafce is quite an anti-pattern: narrow use, non-extensible structure, alternative and more convenient interface already available. > An usage example as follows: > btrfs property set /btrfs readmirror devid:1,3 > btrfs property get /btrfs readmirror > readmirror devid:1 3 > btrfs property set /btrfs readmirror "" > btrfs property get /btrfs readmirror > readmirror default > > This patchset has been tested completely, however marked as RFC for the > following reasons and comments on them (or any other) are appreciated as > usual. > . The new objectid is defined as > #define BTRFS_READMIRROR_OBJECTID -1ULL > Need consent we are fine to use this value, and with this value it > shall be placed just before the DEV_STATS_OBJECTID item which is more > frequently used only during the device errors. -1 can be considered a special value in other cases, not necessarily here but if the ordering of items is the only reason I'd say no. The keys/items will most likely land in the same node so there's no point forcing the order. > . I am using a u64 bitmap to represent the devices id, so the max device > id that we could represent is 63, its a kind of limitation which should > be addressed before integration, I wonder if there is any suggestion? Uh 63, so now an implementation detail is posing a global limit? That sounds backwards. > Kindly note that, multiple ioctls with each time representing a set of > device(s) is not a choice because we need to make sure the readmirror > changes happens in a commit transaction. I believe this can be guaranteed by the properties interface, ie. single value gets processed at once and with some locking around the state of devices can be updated atomically. The open question is still how to store the readmirror property per-device, it could be either an item or bit inside the device structure. Besides the interface, I'm kind of missing the usecase description what is expected from the read mirror policies and how they could affect various scenarios. Maybe it was in some of the previous iterations, it's hard too track everything so this should be part of the cover letter (or at leat linked if it's too much text).
On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: > Function call chain __btrfs_map_block()->find_live_mirror() uses > thread pid to determine the %mirror_num when the mirror_num=0. > > This patch introduces a framework so that we can add policies to determine > the %mirror_num. And also adds the devid as the readmirror policy. > > The new property is stored as an item in the device tree as show below. > (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) > > To be able to set and get this new property also introduces new ioctls > BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument > is defined as > struct btrfs_ioctl_readmirror_args { > __u64 type; /* RW */ > __u64 device_bitmap; /* RW */ > } > > An usage example as follows: > btrfs property set /btrfs readmirror devid:1,3 > btrfs property get /btrfs readmirror > readmirror devid:1 3 > btrfs property set /btrfs readmirror "" > btrfs property get /btrfs readmirror > readmirror default > > This patchset has been tested completely, however marked as RFC for the > following reasons and comments on them (or any other) are appreciated as > usual. > . The new objectid is defined as > #define BTRFS_READMIRROR_OBJECTID -1ULL > Need consent we are fine to use this value, and with this value it > shall be placed just before the DEV_STATS_OBJECTID item which is more > frequently used only during the device errors. > > . I am using a u64 bitmap to represent the devices id, so the max device > id that we could represent is 63, its a kind of limitation which should > be addressed before integration, I wonder if there is any suggestion? > Kindly note that, multiple ioctls with each time representing a set of > device(s) is not a choice because we need to make sure the readmirror > changes happens in a commit transaction. > > v1->RFC v2: > . Property is stored as a dev-tree item instead of root inode extended > attribute. > . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED. > . Changed format specifier from devid1,2,3.. to devid:1,2,3.. > > RFC->v1: > Drops pid as one of the readmirror policy choices and as usual remains > as default. And when the devid is reset the readmirror policy falls back > to pid. > Drops the mount -o readmirror idea, it can be added at a later point of > time. > Property now accepts more than 1 devid as readmirror device. As shown > in the example above. > This is a lot of infrastructure to just change which mirror we read to based on some arbitrary user policy. I assume this is to solve the case where you have slow and fast disks, so you can always read from the fast disk? And then it's only used in RAID1, so the very narrow usecase of having a RAID1 setup with a SSD and a normal disk? I'm not seeing a point to this much code for one particular obscure setup. Thanks, Josef
On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: > > Function call chain __btrfs_map_block()->find_live_mirror() uses > > thread pid to determine the %mirror_num when the mirror_num=0. > > > > This patch introduces a framework so that we can add policies to determine > > the %mirror_num. And also adds the devid as the readmirror policy. > > > > The new property is stored as an item in the device tree as show below. > > (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) > > > > To be able to set and get this new property also introduces new ioctls > > BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument > > is defined as > > struct btrfs_ioctl_readmirror_args { > > __u64 type; /* RW */ > > __u64 device_bitmap; /* RW */ > > } > > > > An usage example as follows: > > btrfs property set /btrfs readmirror devid:1,3 > > btrfs property get /btrfs readmirror > > readmirror devid:1 3 > > btrfs property set /btrfs readmirror "" > > btrfs property get /btrfs readmirror > > readmirror default > > > > This patchset has been tested completely, however marked as RFC for the > > following reasons and comments on them (or any other) are appreciated as > > usual. > > . The new objectid is defined as > > #define BTRFS_READMIRROR_OBJECTID -1ULL > > Need consent we are fine to use this value, and with this value it > > shall be placed just before the DEV_STATS_OBJECTID item which is more > > frequently used only during the device errors. > > > > . I am using a u64 bitmap to represent the devices id, so the max device > > id that we could represent is 63, its a kind of limitation which should > > be addressed before integration, I wonder if there is any suggestion? > > Kindly note that, multiple ioctls with each time representing a set of > > device(s) is not a choice because we need to make sure the readmirror > > changes happens in a commit transaction. > > > > v1->RFC v2: > > . Property is stored as a dev-tree item instead of root inode extended > > attribute. > > . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED. > > . Changed format specifier from devid1,2,3.. to devid:1,2,3.. > > > > RFC->v1: > > Drops pid as one of the readmirror policy choices and as usual remains > > as default. And when the devid is reset the readmirror policy falls back > > to pid. > > Drops the mount -o readmirror idea, it can be added at a later point of > > time. > > Property now accepts more than 1 devid as readmirror device. As shown > > in the example above. > > > > This is a lot of infrastructure to just change which mirror we read to based on > some arbitrary user policy. I assume this is to solve the case where you have > slow and fast disks, so you can always read from the fast disk? And then it's > only used in RAID1, so the very narrow usecase of having a RAID1 setup with a > SSD and a normal disk? I'm not seeing a point to this much code for one > particular obscure setup. Thanks, > > Josef Not commenting on the code itself, but as a user I see this SSD RAID1 acceleration as a future much have feature. It's only obscure at the moment because we don't have code to take advantage of it. But on large btrfs filesystems with hundreds of GB of metadata, like I have for backups, the usability of the filesystem is dramatically improved having the metadata on an SSD( though currently only half of the time due to the even/odd pid distribution.)
On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote: > On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: > > > Function call chain __btrfs_map_block()->find_live_mirror() uses > > > thread pid to determine the %mirror_num when the mirror_num=0. > > > > > > This patch introduces a framework so that we can add policies to determine > > > the %mirror_num. And also adds the devid as the readmirror policy. > > > > > > The new property is stored as an item in the device tree as show below. > > > (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) > > > > > > To be able to set and get this new property also introduces new ioctls > > > BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument > > > is defined as > > > struct btrfs_ioctl_readmirror_args { > > > __u64 type; /* RW */ > > > __u64 device_bitmap; /* RW */ > > > } > > > > > > An usage example as follows: > > > btrfs property set /btrfs readmirror devid:1,3 > > > btrfs property get /btrfs readmirror > > > readmirror devid:1 3 > > > btrfs property set /btrfs readmirror "" > > > btrfs property get /btrfs readmirror > > > readmirror default > > > > > > This patchset has been tested completely, however marked as RFC for the > > > following reasons and comments on them (or any other) are appreciated as > > > usual. > > > . The new objectid is defined as > > > #define BTRFS_READMIRROR_OBJECTID -1ULL > > > Need consent we are fine to use this value, and with this value it > > > shall be placed just before the DEV_STATS_OBJECTID item which is more > > > frequently used only during the device errors. > > > > > > . I am using a u64 bitmap to represent the devices id, so the max device > > > id that we could represent is 63, its a kind of limitation which should > > > be addressed before integration, I wonder if there is any suggestion? > > > Kindly note that, multiple ioctls with each time representing a set of > > > device(s) is not a choice because we need to make sure the readmirror > > > changes happens in a commit transaction. > > > > > > v1->RFC v2: > > > . Property is stored as a dev-tree item instead of root inode extended > > > attribute. > > > . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED. > > > . Changed format specifier from devid1,2,3.. to devid:1,2,3.. > > > > > > RFC->v1: > > > Drops pid as one of the readmirror policy choices and as usual remains > > > as default. And when the devid is reset the readmirror policy falls back > > > to pid. > > > Drops the mount -o readmirror idea, it can be added at a later point of > > > time. > > > Property now accepts more than 1 devid as readmirror device. As shown > > > in the example above. > > > > > > > This is a lot of infrastructure to just change which mirror we read to based on > > some arbitrary user policy. I assume this is to solve the case where you have > > slow and fast disks, so you can always read from the fast disk? And then it's > > only used in RAID1, so the very narrow usecase of having a RAID1 setup with a > > SSD and a normal disk? I'm not seeing a point to this much code for one > > particular obscure setup. Thanks, > > > > Josef > > Not commenting on the code itself, but as a user I see this SSD RAID1 > acceleration as a future much have feature. It's only obscure at the > moment because we don't have code to take advantage of it. But on > large btrfs filesystems with hundreds of GB of metadata, like I have > for backups, the usability of the filesystem is dramatically improved > having the metadata on an SSD( though currently only half of the time > due to the even/odd pid distribution.) But that's different from a mirror. 100% it would be nice to say "put my metadata on the ssd, data elsewhere". That's not what this patch is about, this patch is specifically about changing which drive we choose in a mirrored setup, which is super unlikely to mirror a SSD with a slow drive, cause it's just going to be slow no matter what. Sure we could make it so reads always go to the SSD, but we can accomplish that by just adding a check for nonrotational in the code, and then we don't have to encode all this nonsense in the file system. Thanks, Josef
Thanks for the comments. More below. On 12/9/19 3:16 AM, Josef Bacik wrote: > On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote: >> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote: >>> >>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: >>>> Function call chain __btrfs_map_block()->find_live_mirror() uses >>>> thread pid to determine the %mirror_num when the mirror_num=0. >>>> >>>> This patch introduces a framework so that we can add policies to determine >>>> the %mirror_num. And also adds the devid as the readmirror policy. >>>> >>>> The new property is stored as an item in the device tree as show below. >>>> (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) >>>> >>>> To be able to set and get this new property also introduces new ioctls >>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument >>>> is defined as >>>> struct btrfs_ioctl_readmirror_args { >>>> __u64 type; /* RW */ >>>> __u64 device_bitmap; /* RW */ >>>> } >>>> >>>> An usage example as follows: >>>> btrfs property set /btrfs readmirror devid:1,3 >>>> btrfs property get /btrfs readmirror >>>> readmirror devid:1 3 >>>> btrfs property set /btrfs readmirror "" >>>> btrfs property get /btrfs readmirror >>>> readmirror default >>>> >>>> This patchset has been tested completely, however marked as RFC for the >>>> following reasons and comments on them (or any other) are appreciated as >>>> usual. >>>> . The new objectid is defined as >>>> #define BTRFS_READMIRROR_OBJECTID -1ULL >>>> Need consent we are fine to use this value, and with this value it >>>> shall be placed just before the DEV_STATS_OBJECTID item which is more >>>> frequently used only during the device errors. >>>> >>>> . I am using a u64 bitmap to represent the devices id, so the max device >>>> id that we could represent is 63, its a kind of limitation which should >>>> be addressed before integration, I wonder if there is any suggestion? >>>> Kindly note that, multiple ioctls with each time representing a set of >>>> device(s) is not a choice because we need to make sure the readmirror >>>> changes happens in a commit transaction. >>>> >>>> v1->RFC v2: >>>> . Property is stored as a dev-tree item instead of root inode extended >>>> attribute. >>>> . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED. >>>> . Changed format specifier from devid1,2,3.. to devid:1,2,3.. >>>> >>>> RFC->v1: >>>> Drops pid as one of the readmirror policy choices and as usual remains >>>> as default. And when the devid is reset the readmirror policy falls back >>>> to pid. >>>> Drops the mount -o readmirror idea, it can be added at a later point of >>>> time. >>>> Property now accepts more than 1 devid as readmirror device. As shown >>>> in the example above. >>>> >>> >>> This is a lot of infrastructure Ok. Any idea on a better implementation? How about extended attribute approach? v1 patches proposed it, but it abused the extended attribute as commented here [1] and v2 got changed to an item-key. [1] https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/ >>> to just change which mirror we read to based on >>> some arbitrary user policy. I assume this is to solve the case where you have >>> slow and fast disks, so you can always read from the fast disk? And then it's >>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a >>> SSD and a normal disk? I'm not seeing a point to this much code for one >>> particular obscure setup. Thanks, >>> >>> Josef >> >> Not commenting on the code itself, but as a user I see this SSD RAID1 >> acceleration as a future much have feature. It's only obscure at the >> moment because we don't have code to take advantage of it. But on >> large btrfs filesystems with hundreds of GB of metadata, like I have >> for backups, the usability of the filesystem is dramatically improved >> having the metadata on an SSD( though currently only half of the time >> due to the even/odd pid distribution.) > > But that's different from a mirror. 100% it would be nice to say "put my > metadata on the ssd, data elsewhere". That's not what this patch is about, this > patch is specifically about changing which drive we choose in a mirrored setup, > which is super unlikely to mirror a SSD with a slow drive, cause it's just going > to be slow no matter what. Sure we could make it so reads always go to the SSD, > but we can accomplish that by just adding a check for nonrotational in the code, > and then we don't have to encode all this nonsense in the file system. Thanks, I wrote about the readmirror policy framework here[2], I forgot to link it here, sorry about that, my mistake. [2] https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/ Readmirror policy is for raid1, raid10 and future N way mirror. Yes for now its only for raid1. Here the idea is to create a framework so that readmirror policy can be configured as needed. And nonrotational can be one such policy. The example of hard-coded nonrotational policy does not work in case of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block device, as all these are still nonrotational devices. So hard-coded policy is not a good idea. If we have to hardcode then there is Q-depth based readmirror routing is better (patch in the ML), but that is not good enough, because some configs wants it based on the disk-LBA so that SAN storage target cache is balanced and not duplicated. So in short it must be a configurable policy. devid policy is the first policy which is for the advance users when they know what they are doing, which is sure to support any types of HW configurations/combinations (except for the cache balance). So in total potential configurable policies are: - pid - original. dropped as a policy because of the comments received [3]. - Q-depth - patches are in the ML this can be the default policy. - LBA - to avoid duplicating the cache on the storage target in SAN - devid - as discussed above. - nonrotational - as discussed above. So there are 5 ways to configure as needed, so a framework infrastructure is worth? [3] https://lore.kernel.org/linux-btrfs/20190409154840.GM29086@twin.jikos.cz/ Thanks, Anand > Josef >
On 12/9/19 12:37 AM, David Sterba wrote: > On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: >> Function call chain __btrfs_map_block()->find_live_mirror() uses >> thread pid to determine the %mirror_num when the mirror_num=0. >> >> This patch introduces a framework so that we can add policies to determine >> the %mirror_num. And also adds the devid as the readmirror policy. >> >> The new property is stored as an item in the device tree as show below. >> (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) >> >> To be able to set and get this new property also introduces new ioctls >> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument >> is defined as >> struct btrfs_ioctl_readmirror_args { >> __u64 type; /* RW */ >> __u64 device_bitmap; /* RW */ >> } > > I don't remember if there was a suggestion to use ioctls for read > mirror, but the property interface should be sufficient. Besides this > ioctl interafce is quite an anti-pattern: narrow use, non-extensible > structure, alternative and more convenient interface already available. Extended attribute interface f(get/set)attr is inode bound, but the readmirror property is filesystem bound. For the readmirror we can still use the extended attribute, but it might be considered as abuse which we haven't done so far, here below [1] is the list of property with the interface it uses and where the property is saved. [1] property interface save-as ro ioctl root-item label ioctl super-block compression xattr xattr v1:readmirror xattr xattr v2:readmirror ioctl dev-tree-item You are asking for the combination of property interface save-as readmirror xattr dev-tree-item I can give a try. >> An usage example as follows: >> btrfs property set /btrfs readmirror devid:1,3 >> btrfs property get /btrfs readmirror >> readmirror devid:1 3 >> btrfs property set /btrfs readmirror "" >> btrfs property get /btrfs readmirror >> readmirror default >> >> This patchset has been tested completely, however marked as RFC for the >> following reasons and comments on them (or any other) are appreciated as >> usual. >> . The new objectid is defined as >> #define BTRFS_READMIRROR_OBJECTID -1ULL >> Need consent we are fine to use this value, and with this value it >> shall be placed just before the DEV_STATS_OBJECTID item which is more >> frequently used only during the device errors. > > -1 can be considered a special value in other cases, not necessarily > here but if the ordering of items is the only reason I'd say no. The > keys/items will most likely land in the same node so there's no point > forcing the order. Agreed. Any suggestion on the value for the BTRFS_READMIRROR_OBJECTID. >> . I am using a u64 bitmap to represent the devices id, so the max device >> id that we could represent is 63, its a kind of limitation which should >> be addressed before integration, I wonder if there is any suggestion? > > Uh 63, so now an implementation detail is posing a global limit? That > sounds backwards. Yes. And I was thinking of u64 array but that doesn't scale as well. Anyways I have in the list to try using xattr interface which may address this issue. >> Kindly note that, multiple ioctls with each time representing a set of >> device(s) is not a choice because we need to make sure the readmirror >> changes happens in a commit transaction. > > I believe this can be guaranteed by the properties interface, ie. single > value gets processed at once and with some locking around the state of > devices can be updated atomically. > > The open question is still how to store the readmirror property > per-device, it could be either an item or bit inside the device > structure. We discussed that before here. https://lore.kernel.org/linux-btrfs/8d31c3a2-3fb0-63af-3173-948ed0e84de3@oracle.com/ > Besides the interface, I'm kind of missing the usecase description what > is expected from the read mirror policies and how they could affect > various scenarios. Maybe it was in some of the previous iterations, it's > hard too track everything so this should be part of the cover letter (or > at leat linked if it's too much text). > Yep. My mistake I missed it link it. Sorry about that. Thanks, Anand
On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote: > > > Thanks for the comments. More below. > > On 12/9/19 3:16 AM, Josef Bacik wrote: > > On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote: > > > On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > > > > > On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: > > > > > Function call chain __btrfs_map_block()->find_live_mirror() uses > > > > > thread pid to determine the %mirror_num when the mirror_num=0. > > > > > > > > > > This patch introduces a framework so that we can add policies to determine > > > > > the %mirror_num. And also adds the devid as the readmirror policy. > > > > > > > > > > The new property is stored as an item in the device tree as show below. > > > > > (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) > > > > > > > > > > To be able to set and get this new property also introduces new ioctls > > > > > BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument > > > > > is defined as > > > > > struct btrfs_ioctl_readmirror_args { > > > > > __u64 type; /* RW */ > > > > > __u64 device_bitmap; /* RW */ > > > > > } > > > > > > > > > > An usage example as follows: > > > > > btrfs property set /btrfs readmirror devid:1,3 > > > > > btrfs property get /btrfs readmirror > > > > > readmirror devid:1 3 > > > > > btrfs property set /btrfs readmirror "" > > > > > btrfs property get /btrfs readmirror > > > > > readmirror default > > > > > > > > > > This patchset has been tested completely, however marked as RFC for the > > > > > following reasons and comments on them (or any other) are appreciated as > > > > > usual. > > > > > . The new objectid is defined as > > > > > #define BTRFS_READMIRROR_OBJECTID -1ULL > > > > > Need consent we are fine to use this value, and with this value it > > > > > shall be placed just before the DEV_STATS_OBJECTID item which is more > > > > > frequently used only during the device errors. > > > > > > > > > > . I am using a u64 bitmap to represent the devices id, so the max device > > > > > id that we could represent is 63, its a kind of limitation which should > > > > > be addressed before integration, I wonder if there is any suggestion? > > > > > Kindly note that, multiple ioctls with each time representing a set of > > > > > device(s) is not a choice because we need to make sure the readmirror > > > > > changes happens in a commit transaction. > > > > > > > > > > v1->RFC v2: > > > > > . Property is stored as a dev-tree item instead of root inode extended > > > > > attribute. > > > > > . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED. > > > > > . Changed format specifier from devid1,2,3.. to devid:1,2,3.. > > > > > > > > > > RFC->v1: > > > > > Drops pid as one of the readmirror policy choices and as usual remains > > > > > as default. And when the devid is reset the readmirror policy falls back > > > > > to pid. > > > > > Drops the mount -o readmirror idea, it can be added at a later point of > > > > > time. > > > > > Property now accepts more than 1 devid as readmirror device. As shown > > > > > in the example above. > > > > > > > > > > > > > This is a lot of infrastructure > > Ok. Any idea on a better implementation? > How about extended attribute approach? v1 patches proposed > it, but it abused the extended attribute as commented here [1] > and v2 got changed to an item-key. > > [1] > https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/ > That's a NAK on the prop interface. This is a fs wide policy, not a directory/inode policy. > > > > > to just change which mirror we read to based on > > > > some arbitrary user policy. I assume this is to solve the case where you have > > > > slow and fast disks, so you can always read from the fast disk? And then it's > > > > only used in RAID1, so the very narrow usecase of having a RAID1 setup with a > > > > SSD and a normal disk? I'm not seeing a point to this much code for one > > > > particular obscure setup. Thanks, > > > > > > > > Josef > > > > > > Not commenting on the code itself, but as a user I see this SSD RAID1 > > > acceleration as a future much have feature. It's only obscure at the > > > moment because we don't have code to take advantage of it. But on > > > large btrfs filesystems with hundreds of GB of metadata, like I have > > > for backups, the usability of the filesystem is dramatically improved > > > having the metadata on an SSD( though currently only half of the time > > > due to the even/odd pid distribution.) > > > > But that's different from a mirror. 100% it would be nice to say "put my > > metadata on the ssd, data elsewhere". That's not what this patch is about, this > > patch is specifically about changing which drive we choose in a mirrored setup, > > which is super unlikely to mirror a SSD with a slow drive, cause it's just going > > to be slow no matter what. Sure we could make it so reads always go to the SSD, > > but we can accomplish that by just adding a check for nonrotational in the code, > > and then we don't have to encode all this nonsense in the file system. Thanks, > > I wrote about the readmirror policy framework here[2], > I forgot to link it here, sorry about that, my mistake. > > [2] > > https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/ > > Readmirror policy is for raid1, raid10 and future N way mirror. > Yes for now its only for raid1. > > Here the idea is to create a framework so that readmirror policy > can be configured as needed. And nonrotational can be one such policy. > > The example of hard-coded nonrotational policy does not work in case > of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block > device, as all these are still nonrotational devices. So hard-coded > policy is not a good idea. If we have to hardcode then there is Q-depth > based readmirror routing is better (patch in the ML), but that is > not good enough, because some configs wants it based on the disk-LBA > so that SAN storage target cache is balanced and not duplicated. > So in short it must be a configurable policy. > Again, if you are mixing disk types you likely always want non-rotational, but still mixing different speed devices in a mirror setup is just asking for weird latency problems. I don't think solving this use case is necessary. If you mix ssd + network device in a serious production setup then you probably should be fired cause you don't know what you are doing. Having the generic "nonrotational gets priority" is going to cover 99% of the actual use cases that make sense. The SAN usecase I can sort of see, but again I don't feel like it's a problem we need to solve with on-disk format. Add a priority to sysfs so you can change it with udev or something on the fly. Thanks, Josef
> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote: > > On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote: >> >> >> Thanks for the comments. More below. >> >> On 12/9/19 3:16 AM, Josef Bacik wrote: >>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote: >>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote: >>>>> >>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: >>>>>> Function call chain __btrfs_map_block()->find_live_mirror() uses >>>>>> thread pid to determine the %mirror_num when the mirror_num=0. >>>>>> >>>>>> This patch introduces a framework so that we can add policies to determine >>>>>> the %mirror_num. And also adds the devid as the readmirror policy. >>>>>> >>>>>> The new property is stored as an item in the device tree as show below. >>>>>> (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) >>>>>> >>>>>> To be able to set and get this new property also introduces new ioctls >>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument >>>>>> is defined as >>>>>> struct btrfs_ioctl_readmirror_args { >>>>>> __u64 type; /* RW */ >>>>>> __u64 device_bitmap; /* RW */ >>>>>> } >>>>>> >>>>>> An usage example as follows: >>>>>> btrfs property set /btrfs readmirror devid:1,3 >>>>>> btrfs property get /btrfs readmirror >>>>>> readmirror devid:1 3 >>>>>> btrfs property set /btrfs readmirror "" >>>>>> btrfs property get /btrfs readmirror >>>>>> readmirror default >>>>>> >>>>>> This patchset has been tested completely, however marked as RFC for the >>>>>> following reasons and comments on them (or any other) are appreciated as >>>>>> usual. >>>>>> . The new objectid is defined as >>>>>> #define BTRFS_READMIRROR_OBJECTID -1ULL >>>>>> Need consent we are fine to use this value, and with this value it >>>>>> shall be placed just before the DEV_STATS_OBJECTID item which is more >>>>>> frequently used only during the device errors. >>>>>> >>>>>> . I am using a u64 bitmap to represent the devices id, so the max device >>>>>> id that we could represent is 63, its a kind of limitation which should >>>>>> be addressed before integration, I wonder if there is any suggestion? >>>>>> Kindly note that, multiple ioctls with each time representing a set of >>>>>> device(s) is not a choice because we need to make sure the readmirror >>>>>> changes happens in a commit transaction. >>>>>> >>>>>> v1->RFC v2: >>>>>> . Property is stored as a dev-tree item instead of root inode extended >>>>>> attribute. >>>>>> . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED. >>>>>> . Changed format specifier from devid1,2,3.. to devid:1,2,3.. >>>>>> >>>>>> RFC->v1: >>>>>> Drops pid as one of the readmirror policy choices and as usual remains >>>>>> as default. And when the devid is reset the readmirror policy falls back >>>>>> to pid. >>>>>> Drops the mount -o readmirror idea, it can be added at a later point of >>>>>> time. >>>>>> Property now accepts more than 1 devid as readmirror device. As shown >>>>>> in the example above. >>>>>> >>>>> >>>>> This is a lot of infrastructure >> >> Ok. Any idea on a better implementation? >> How about extended attribute approach? v1 patches proposed >> it, but it abused the extended attribute as commented here [1] >> and v2 got changed to an item-key. >> >> [1] >> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/ >> > > That's a NAK on the prop interface. This is a fs wide policy, not a > directory/inode policy. > >> >>>>> to just change which mirror we read to based on >>>>> some arbitrary user policy. I assume this is to solve the case where you have >>>>> slow and fast disks, so you can always read from the fast disk? And then it's >>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a >>>>> SSD and a normal disk? I'm not seeing a point to this much code for one >>>>> particular obscure setup. Thanks, >>>>> >>>>> Josef >>>> >>>> Not commenting on the code itself, but as a user I see this SSD RAID1 >>>> acceleration as a future much have feature. It's only obscure at the >>>> moment because we don't have code to take advantage of it. But on >>>> large btrfs filesystems with hundreds of GB of metadata, like I have >>>> for backups, the usability of the filesystem is dramatically improved >>>> having the metadata on an SSD( though currently only half of the time >>>> due to the even/odd pid distribution.) >>> >>> But that's different from a mirror. 100% it would be nice to say "put my >>> metadata on the ssd, data elsewhere". That's not what this patch is about, this >>> patch is specifically about changing which drive we choose in a mirrored setup, >>> which is super unlikely to mirror a SSD with a slow drive, cause it's just going >>> to be slow no matter what. Sure we could make it so reads always go to the SSD, >>> but we can accomplish that by just adding a check for nonrotational in the code, >>> and then we don't have to encode all this nonsense in the file system. Thanks, >> >> I wrote about the readmirror policy framework here[2], >> I forgot to link it here, sorry about that, my mistake. >> >> [2] >> >> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/ >> >> Readmirror policy is for raid1, raid10 and future N way mirror. >> Yes for now its only for raid1. >> >> Here the idea is to create a framework so that readmirror policy >> can be configured as needed. And nonrotational can be one such policy. >> >> The example of hard-coded nonrotational policy does not work in case >> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block >> device, as all these are still nonrotational devices. So hard-coded >> policy is not a good idea. If we have to hardcode then there is Q-depth >> based readmirror routing is better (patch in the ML), but that is >> not good enough, because some configs wants it based on the disk-LBA >> so that SAN storage target cache is balanced and not duplicated. >> So in short it must be a configurable policy. >> > > Again, if you are mixing disk types you likely always want non-rotational, but > still mixing different speed devices in a mirror setup is just asking for weird > latency problems. I don't think solving this use case is necessary. If you mix > ssd + network device in a serious production setup then you probably should be > fired cause you don't know what you are doing. Having the generic > "nonrotational gets priority" is going to cover 99% of the actual use cases that > make sense. > > The SAN usecase I can sort of see, but again I don't feel like it's a problem we > need to solve with on-disk format. Add a priority to sysfs so you can change it > with udev or something on the fly. Thanks, > Ok. Sysfs is fine however we need configuration to be persistent across reboots. Any idea? Thanks, Anand > Josef
On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote: > > > > On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote: > > > > On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote: > >> > >> > >> Thanks for the comments. More below. > >> > >> On 12/9/19 3:16 AM, Josef Bacik wrote: > >>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote: > >>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote: > >>>>> > >>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: > >>>>>> Function call chain __btrfs_map_block()->find_live_mirror() uses > >>>>>> thread pid to determine the %mirror_num when the mirror_num=0. > >>>>>> > >>>>>> This patch introduces a framework so that we can add policies to determine > >>>>>> the %mirror_num. And also adds the devid as the readmirror policy. > >>>>>> > >>>>>> The new property is stored as an item in the device tree as show below. > >>>>>> (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) > >>>>>> > >>>>>> To be able to set and get this new property also introduces new ioctls > >>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument > >>>>>> is defined as > >>>>>> struct btrfs_ioctl_readmirror_args { > >>>>>> __u64 type; /* RW */ > >>>>>> __u64 device_bitmap; /* RW */ > >>>>>> } > >>>>>> > >>>>>> An usage example as follows: > >>>>>> btrfs property set /btrfs readmirror devid:1,3 > >>>>>> btrfs property get /btrfs readmirror > >>>>>> readmirror devid:1 3 > >>>>>> btrfs property set /btrfs readmirror "" > >>>>>> btrfs property get /btrfs readmirror > >>>>>> readmirror default > >>>>>> > >>>>>> This patchset has been tested completely, however marked as RFC for the > >>>>>> following reasons and comments on them (or any other) are appreciated as > >>>>>> usual. > >>>>>> . The new objectid is defined as > >>>>>> #define BTRFS_READMIRROR_OBJECTID -1ULL > >>>>>> Need consent we are fine to use this value, and with this value it > >>>>>> shall be placed just before the DEV_STATS_OBJECTID item which is more > >>>>>> frequently used only during the device errors. > >>>>>> > >>>>>> . I am using a u64 bitmap to represent the devices id, so the max device > >>>>>> id that we could represent is 63, its a kind of limitation which should > >>>>>> be addressed before integration, I wonder if there is any suggestion? > >>>>>> Kindly note that, multiple ioctls with each time representing a set of > >>>>>> device(s) is not a choice because we need to make sure the readmirror > >>>>>> changes happens in a commit transaction. > >>>>>> > >>>>>> v1->RFC v2: > >>>>>> . Property is stored as a dev-tree item instead of root inode extended > >>>>>> attribute. > >>>>>> . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED. > >>>>>> . Changed format specifier from devid1,2,3.. to devid:1,2,3.. > >>>>>> > >>>>>> RFC->v1: > >>>>>> Drops pid as one of the readmirror policy choices and as usual remains > >>>>>> as default. And when the devid is reset the readmirror policy falls back > >>>>>> to pid. > >>>>>> Drops the mount -o readmirror idea, it can be added at a later point of > >>>>>> time. > >>>>>> Property now accepts more than 1 devid as readmirror device. As shown > >>>>>> in the example above. > >>>>>> > >>>>> > >>>>> This is a lot of infrastructure > >> > >> Ok. Any idea on a better implementation? > >> How about extended attribute approach? v1 patches proposed > >> it, but it abused the extended attribute as commented here [1] > >> and v2 got changed to an item-key. > >> > >> [1] > >> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/ > >> > > > > That's a NAK on the prop interface. This is a fs wide policy, not a > > directory/inode policy. > > > >> > >>>>> to just change which mirror we read to based on > >>>>> some arbitrary user policy. I assume this is to solve the case where you have > >>>>> slow and fast disks, so you can always read from the fast disk? And then it's > >>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a > >>>>> SSD and a normal disk? I'm not seeing a point to this much code for one > >>>>> particular obscure setup. Thanks, > >>>>> > >>>>> Josef > >>>> > >>>> Not commenting on the code itself, but as a user I see this SSD RAID1 > >>>> acceleration as a future much have feature. It's only obscure at the > >>>> moment because we don't have code to take advantage of it. But on > >>>> large btrfs filesystems with hundreds of GB of metadata, like I have > >>>> for backups, the usability of the filesystem is dramatically improved > >>>> having the metadata on an SSD( though currently only half of the time > >>>> due to the even/odd pid distribution.) > >>> > >>> But that's different from a mirror. 100% it would be nice to say "put my > >>> metadata on the ssd, data elsewhere". That's not what this patch is about, this > >>> patch is specifically about changing which drive we choose in a mirrored setup, > >>> which is super unlikely to mirror a SSD with a slow drive, cause it's just going > >>> to be slow no matter what. Sure we could make it so reads always go to the SSD, > >>> but we can accomplish that by just adding a check for nonrotational in the code, > >>> and then we don't have to encode all this nonsense in the file system. Thanks, > >> > >> I wrote about the readmirror policy framework here[2], > >> I forgot to link it here, sorry about that, my mistake. > >> > >> [2] > >> > >> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/ > >> > >> Readmirror policy is for raid1, raid10 and future N way mirror. > >> Yes for now its only for raid1. > >> > >> Here the idea is to create a framework so that readmirror policy > >> can be configured as needed. And nonrotational can be one such policy. > >> > >> The example of hard-coded nonrotational policy does not work in case > >> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block > >> device, as all these are still nonrotational devices. So hard-coded > >> policy is not a good idea. If we have to hardcode then there is Q-depth > >> based readmirror routing is better (patch in the ML), but that is > >> not good enough, because some configs wants it based on the disk-LBA > >> so that SAN storage target cache is balanced and not duplicated. > >> So in short it must be a configurable policy. > >> > > > > Again, if you are mixing disk types you likely always want non-rotational, but > > still mixing different speed devices in a mirror setup is just asking for weird > > latency problems. I don't think solving this use case is necessary. If you mix > > ssd + network device in a serious production setup then you probably should be > > fired cause you don't know what you are doing. Having the generic > > "nonrotational gets priority" is going to cover 99% of the actual use cases that > > make sense. > > > > The SAN usecase I can sort of see, but again I don't feel like it's a problem we > > need to solve with on-disk format. Add a priority to sysfs so you can change it > > with udev or something on the fly. Thanks, > > > > Ok. > Sysfs is fine however we need configuration to be persistent across reboots. > Any idea? > Udev rules. Thanks, Josef
> On 12 Sep 2019, at 6:03 PM, Josef Bacik <josef@toxicpanda.com> wrote: > > On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote: >> >> >>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote: >>> >>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote: >>>> >>>> >>>> Thanks for the comments. More below. >>>> >>>> On 12/9/19 3:16 AM, Josef Bacik wrote: >>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote: >>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote: >>>>>>> >>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: >>>>>>>> Function call chain __btrfs_map_block()->find_live_mirror() uses >>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0. >>>>>>>> >>>>>>>> This patch introduces a framework so that we can add policies to determine >>>>>>>> the %mirror_num. And also adds the devid as the readmirror policy. >>>>>>>> >>>>>>>> The new property is stored as an item in the device tree as show below. >>>>>>>> (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) >>>>>>>> >>>>>>>> To be able to set and get this new property also introduces new ioctls >>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument >>>>>>>> is defined as >>>>>>>> struct btrfs_ioctl_readmirror_args { >>>>>>>> __u64 type; /* RW */ >>>>>>>> __u64 device_bitmap; /* RW */ >>>>>>>> } >>>>>>>> >>>>>>>> An usage example as follows: >>>>>>>> btrfs property set /btrfs readmirror devid:1,3 >>>>>>>> btrfs property get /btrfs readmirror >>>>>>>> readmirror devid:1 3 >>>>>>>> btrfs property set /btrfs readmirror "" >>>>>>>> btrfs property get /btrfs readmirror >>>>>>>> readmirror default >>>>>>>> >>>>>>>> This patchset has been tested completely, however marked as RFC for the >>>>>>>> following reasons and comments on them (or any other) are appreciated as >>>>>>>> usual. >>>>>>>> . The new objectid is defined as >>>>>>>> #define BTRFS_READMIRROR_OBJECTID -1ULL >>>>>>>> Need consent we are fine to use this value, and with this value it >>>>>>>> shall be placed just before the DEV_STATS_OBJECTID item which is more >>>>>>>> frequently used only during the device errors. >>>>>>>> >>>>>>>> . I am using a u64 bitmap to represent the devices id, so the max device >>>>>>>> id that we could represent is 63, its a kind of limitation which should >>>>>>>> be addressed before integration, I wonder if there is any suggestion? >>>>>>>> Kindly note that, multiple ioctls with each time representing a set of >>>>>>>> device(s) is not a choice because we need to make sure the readmirror >>>>>>>> changes happens in a commit transaction. >>>>>>>> >>>>>>>> v1->RFC v2: >>>>>>>> . Property is stored as a dev-tree item instead of root inode extended >>>>>>>> attribute. >>>>>>>> . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED. >>>>>>>> . Changed format specifier from devid1,2,3.. to devid:1,2,3.. >>>>>>>> >>>>>>>> RFC->v1: >>>>>>>> Drops pid as one of the readmirror policy choices and as usual remains >>>>>>>> as default. And when the devid is reset the readmirror policy falls back >>>>>>>> to pid. >>>>>>>> Drops the mount -o readmirror idea, it can be added at a later point of >>>>>>>> time. >>>>>>>> Property now accepts more than 1 devid as readmirror device. As shown >>>>>>>> in the example above. >>>>>>>> >>>>>>> >>>>>>> This is a lot of infrastructure >>>> >>>> Ok. Any idea on a better implementation? >>>> How about extended attribute approach? v1 patches proposed >>>> it, but it abused the extended attribute as commented here [1] >>>> and v2 got changed to an item-key. >>>> >>>> [1] >>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/ >>>> >>> >>> That's a NAK on the prop interface. This is a fs wide policy, not a >>> directory/inode policy. >>> >>>> >>>>>>> to just change which mirror we read to based on >>>>>>> some arbitrary user policy. I assume this is to solve the case where you have >>>>>>> slow and fast disks, so you can always read from the fast disk? And then it's >>>>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a >>>>>>> SSD and a normal disk? I'm not seeing a point to this much code for one >>>>>>> particular obscure setup. Thanks, >>>>>>> >>>>>>> Josef >>>>>> >>>>>> Not commenting on the code itself, but as a user I see this SSD RAID1 >>>>>> acceleration as a future much have feature. It's only obscure at the >>>>>> moment because we don't have code to take advantage of it. But on >>>>>> large btrfs filesystems with hundreds of GB of metadata, like I have >>>>>> for backups, the usability of the filesystem is dramatically improved >>>>>> having the metadata on an SSD( though currently only half of the time >>>>>> due to the even/odd pid distribution.) >>>>> >>>>> But that's different from a mirror. 100% it would be nice to say "put my >>>>> metadata on the ssd, data elsewhere". That's not what this patch is about, this >>>>> patch is specifically about changing which drive we choose in a mirrored setup, >>>>> which is super unlikely to mirror a SSD with a slow drive, cause it's just going >>>>> to be slow no matter what. Sure we could make it so reads always go to the SSD, >>>>> but we can accomplish that by just adding a check for nonrotational in the code, >>>>> and then we don't have to encode all this nonsense in the file system. Thanks, >>>> >>>> I wrote about the readmirror policy framework here[2], >>>> I forgot to link it here, sorry about that, my mistake. >>>> >>>> [2] >>>> >>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/ >>>> >>>> Readmirror policy is for raid1, raid10 and future N way mirror. >>>> Yes for now its only for raid1. >>>> >>>> Here the idea is to create a framework so that readmirror policy >>>> can be configured as needed. And nonrotational can be one such policy. >>>> >>>> The example of hard-coded nonrotational policy does not work in case >>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block >>>> device, as all these are still nonrotational devices. So hard-coded >>>> policy is not a good idea. If we have to hardcode then there is Q-depth >>>> based readmirror routing is better (patch in the ML), but that is >>>> not good enough, because some configs wants it based on the disk-LBA >>>> so that SAN storage target cache is balanced and not duplicated. >>>> So in short it must be a configurable policy. >>>> >>> >>> Again, if you are mixing disk types you likely always want non-rotational, but >>> still mixing different speed devices in a mirror setup is just asking for weird >>> latency problems. I don't think solving this use case is necessary. If you mix >>> ssd + network device in a serious production setup then you probably should be >>> fired cause you don't know what you are doing. Having the generic >>> "nonrotational gets priority" is going to cover 99% of the actual use cases that >>> make sense. >>> >>> The SAN usecase I can sort of see, but again I don't feel like it's a problem we >>> need to solve with on-disk format. Add a priority to sysfs so you can change it >>> with udev or something on the fly. Thanks, >>> >> >> Ok. >> Sysfs is fine however we need configuration to be persistent across reboots. >> Any idea? >> > > Udev rules. Thanks, > Josef, configs moving along with the luns in san seems to be more easy to manage, otherwise when the host fails, each potential new server has to be pre-configured with the udev rules. Thanks, Anand > Josef
On Thu, Sep 12, 2019 at 06:10:08PM +0800, Anand Jain wrote: > > > > On 12 Sep 2019, at 6:03 PM, Josef Bacik <josef@toxicpanda.com> wrote: > > > > On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote: > >> > >> > >>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote: > >>> > >>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote: > >>>> > >>>> > >>>> Thanks for the comments. More below. > >>>> > >>>> On 12/9/19 3:16 AM, Josef Bacik wrote: > >>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote: > >>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote: > >>>>>>> > >>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: > >>>>>>>> Function call chain __btrfs_map_block()->find_live_mirror() uses > >>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0. > >>>>>>>> > >>>>>>>> This patch introduces a framework so that we can add policies to determine > >>>>>>>> the %mirror_num. And also adds the devid as the readmirror policy. > >>>>>>>> > >>>>>>>> The new property is stored as an item in the device tree as show below. > >>>>>>>> (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) > >>>>>>>> > >>>>>>>> To be able to set and get this new property also introduces new ioctls > >>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument > >>>>>>>> is defined as > >>>>>>>> struct btrfs_ioctl_readmirror_args { > >>>>>>>> __u64 type; /* RW */ > >>>>>>>> __u64 device_bitmap; /* RW */ > >>>>>>>> } > >>>>>>>> > >>>>>>>> An usage example as follows: > >>>>>>>> btrfs property set /btrfs readmirror devid:1,3 > >>>>>>>> btrfs property get /btrfs readmirror > >>>>>>>> readmirror devid:1 3 > >>>>>>>> btrfs property set /btrfs readmirror "" > >>>>>>>> btrfs property get /btrfs readmirror > >>>>>>>> readmirror default > >>>>>>>> > >>>>>>>> This patchset has been tested completely, however marked as RFC for the > >>>>>>>> following reasons and comments on them (or any other) are appreciated as > >>>>>>>> usual. > >>>>>>>> . The new objectid is defined as > >>>>>>>> #define BTRFS_READMIRROR_OBJECTID -1ULL > >>>>>>>> Need consent we are fine to use this value, and with this value it > >>>>>>>> shall be placed just before the DEV_STATS_OBJECTID item which is more > >>>>>>>> frequently used only during the device errors. > >>>>>>>> > >>>>>>>> . I am using a u64 bitmap to represent the devices id, so the max device > >>>>>>>> id that we could represent is 63, its a kind of limitation which should > >>>>>>>> be addressed before integration, I wonder if there is any suggestion? > >>>>>>>> Kindly note that, multiple ioctls with each time representing a set of > >>>>>>>> device(s) is not a choice because we need to make sure the readmirror > >>>>>>>> changes happens in a commit transaction. > >>>>>>>> > >>>>>>>> v1->RFC v2: > >>>>>>>> . Property is stored as a dev-tree item instead of root inode extended > >>>>>>>> attribute. > >>>>>>>> . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED. > >>>>>>>> . Changed format specifier from devid1,2,3.. to devid:1,2,3.. > >>>>>>>> > >>>>>>>> RFC->v1: > >>>>>>>> Drops pid as one of the readmirror policy choices and as usual remains > >>>>>>>> as default. And when the devid is reset the readmirror policy falls back > >>>>>>>> to pid. > >>>>>>>> Drops the mount -o readmirror idea, it can be added at a later point of > >>>>>>>> time. > >>>>>>>> Property now accepts more than 1 devid as readmirror device. As shown > >>>>>>>> in the example above. > >>>>>>>> > >>>>>>> > >>>>>>> This is a lot of infrastructure > >>>> > >>>> Ok. Any idea on a better implementation? > >>>> How about extended attribute approach? v1 patches proposed > >>>> it, but it abused the extended attribute as commented here [1] > >>>> and v2 got changed to an item-key. > >>>> > >>>> [1] > >>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/ > >>>> > >>> > >>> That's a NAK on the prop interface. This is a fs wide policy, not a > >>> directory/inode policy. > >>> > >>>> > >>>>>>> to just change which mirror we read to based on > >>>>>>> some arbitrary user policy. I assume this is to solve the case where you have > >>>>>>> slow and fast disks, so you can always read from the fast disk? And then it's > >>>>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a > >>>>>>> SSD and a normal disk? I'm not seeing a point to this much code for one > >>>>>>> particular obscure setup. Thanks, > >>>>>>> > >>>>>>> Josef > >>>>>> > >>>>>> Not commenting on the code itself, but as a user I see this SSD RAID1 > >>>>>> acceleration as a future much have feature. It's only obscure at the > >>>>>> moment because we don't have code to take advantage of it. But on > >>>>>> large btrfs filesystems with hundreds of GB of metadata, like I have > >>>>>> for backups, the usability of the filesystem is dramatically improved > >>>>>> having the metadata on an SSD( though currently only half of the time > >>>>>> due to the even/odd pid distribution.) > >>>>> > >>>>> But that's different from a mirror. 100% it would be nice to say "put my > >>>>> metadata on the ssd, data elsewhere". That's not what this patch is about, this > >>>>> patch is specifically about changing which drive we choose in a mirrored setup, > >>>>> which is super unlikely to mirror a SSD with a slow drive, cause it's just going > >>>>> to be slow no matter what. Sure we could make it so reads always go to the SSD, > >>>>> but we can accomplish that by just adding a check for nonrotational in the code, > >>>>> and then we don't have to encode all this nonsense in the file system. Thanks, > >>>> > >>>> I wrote about the readmirror policy framework here[2], > >>>> I forgot to link it here, sorry about that, my mistake. > >>>> > >>>> [2] > >>>> > >>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/ > >>>> > >>>> Readmirror policy is for raid1, raid10 and future N way mirror. > >>>> Yes for now its only for raid1. > >>>> > >>>> Here the idea is to create a framework so that readmirror policy > >>>> can be configured as needed. And nonrotational can be one such policy. > >>>> > >>>> The example of hard-coded nonrotational policy does not work in case > >>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block > >>>> device, as all these are still nonrotational devices. So hard-coded > >>>> policy is not a good idea. If we have to hardcode then there is Q-depth > >>>> based readmirror routing is better (patch in the ML), but that is > >>>> not good enough, because some configs wants it based on the disk-LBA > >>>> so that SAN storage target cache is balanced and not duplicated. > >>>> So in short it must be a configurable policy. > >>>> > >>> > >>> Again, if you are mixing disk types you likely always want non-rotational, but > >>> still mixing different speed devices in a mirror setup is just asking for weird > >>> latency problems. I don't think solving this use case is necessary. If you mix > >>> ssd + network device in a serious production setup then you probably should be > >>> fired cause you don't know what you are doing. Having the generic > >>> "nonrotational gets priority" is going to cover 99% of the actual use cases that > >>> make sense. > >>> > >>> The SAN usecase I can sort of see, but again I don't feel like it's a problem we > >>> need to solve with on-disk format. Add a priority to sysfs so you can change it > >>> with udev or something on the fly. Thanks, > >>> > >> > >> Ok. > >> Sysfs is fine however we need configuration to be persistent across reboots. > >> Any idea? > >> > > > > Udev rules. Thanks, > > > > Josef, configs moving along with the luns in san seems to be more > easy to manage, otherwise when the host fails, each potential new > server has to be pre-configured with the udev rules. > It's 2019, if people haven't figured out how to do persistent configuration by now then all hope is lost. Facebook persistently configures millions of machines, I'm sure people can figure out how to make sure a udev rule ends up on the right host connected to a SAN that doesn't move. Thanks, Josef
On 12/9/19 6:13 PM, Josef Bacik wrote: > On Thu, Sep 12, 2019 at 06:10:08PM +0800, Anand Jain wrote: >> >> >>> On 12 Sep 2019, at 6:03 PM, Josef Bacik <josef@toxicpanda.com> wrote: >>> >>> On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote: >>>> >>>> >>>>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote: >>>>> >>>>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote: >>>>>> >>>>>> >>>>>> Thanks for the comments. More below. >>>>>> >>>>>> On 12/9/19 3:16 AM, Josef Bacik wrote: >>>>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote: >>>>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote: >>>>>>>>> >>>>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: >>>>>>>>>> Function call chain __btrfs_map_block()->find_live_mirror() uses >>>>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0. >>>>>>>>>> >>>>>>>>>> This patch introduces a framework so that we can add policies to determine >>>>>>>>>> the %mirror_num. And also adds the devid as the readmirror policy. >>>>>>>>>> >>>>>>>>>> The new property is stored as an item in the device tree as show below. >>>>>>>>>> (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) >>>>>>>>>> >>>>>>>>>> To be able to set and get this new property also introduces new ioctls >>>>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument >>>>>>>>>> is defined as >>>>>>>>>> struct btrfs_ioctl_readmirror_args { >>>>>>>>>> __u64 type; /* RW */ >>>>>>>>>> __u64 device_bitmap; /* RW */ >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> An usage example as follows: >>>>>>>>>> btrfs property set /btrfs readmirror devid:1,3 >>>>>>>>>> btrfs property get /btrfs readmirror >>>>>>>>>> readmirror devid:1 3 >>>>>>>>>> btrfs property set /btrfs readmirror "" >>>>>>>>>> btrfs property get /btrfs readmirror >>>>>>>>>> readmirror default >>>>>>>>>> >>>>>>>>>> This patchset has been tested completely, however marked as RFC for the >>>>>>>>>> following reasons and comments on them (or any other) are appreciated as >>>>>>>>>> usual. >>>>>>>>>> . The new objectid is defined as >>>>>>>>>> #define BTRFS_READMIRROR_OBJECTID -1ULL >>>>>>>>>> Need consent we are fine to use this value, and with this value it >>>>>>>>>> shall be placed just before the DEV_STATS_OBJECTID item which is more >>>>>>>>>> frequently used only during the device errors. >>>>>>>>>> >>>>>>>>>> . I am using a u64 bitmap to represent the devices id, so the max device >>>>>>>>>> id that we could represent is 63, its a kind of limitation which should >>>>>>>>>> be addressed before integration, I wonder if there is any suggestion? >>>>>>>>>> Kindly note that, multiple ioctls with each time representing a set of >>>>>>>>>> device(s) is not a choice because we need to make sure the readmirror >>>>>>>>>> changes happens in a commit transaction. >>>>>>>>>> >>>>>>>>>> v1->RFC v2: >>>>>>>>>> . Property is stored as a dev-tree item instead of root inode extended >>>>>>>>>> attribute. >>>>>>>>>> . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED. >>>>>>>>>> . Changed format specifier from devid1,2,3.. to devid:1,2,3.. >>>>>>>>>> >>>>>>>>>> RFC->v1: >>>>>>>>>> Drops pid as one of the readmirror policy choices and as usual remains >>>>>>>>>> as default. And when the devid is reset the readmirror policy falls back >>>>>>>>>> to pid. >>>>>>>>>> Drops the mount -o readmirror idea, it can be added at a later point of >>>>>>>>>> time. >>>>>>>>>> Property now accepts more than 1 devid as readmirror device. As shown >>>>>>>>>> in the example above. >>>>>>>>>> >>>>>>>>> >>>>>>>>> This is a lot of infrastructure >>>>>> >>>>>> Ok. Any idea on a better implementation? >>>>>> How about extended attribute approach? v1 patches proposed >>>>>> it, but it abused the extended attribute as commented here [1] >>>>>> and v2 got changed to an item-key. >>>>>> >>>>>> [1] >>>>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/ >>>>>> >>>>> >>>>> That's a NAK on the prop interface. This is a fs wide policy, not a >>>>> directory/inode policy. >>>>> >>>>>> >>>>>>>>> to just change which mirror we read to based on >>>>>>>>> some arbitrary user policy. I assume this is to solve the case where you have >>>>>>>>> slow and fast disks, so you can always read from the fast disk? And then it's >>>>>>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a >>>>>>>>> SSD and a normal disk? I'm not seeing a point to this much code for one >>>>>>>>> particular obscure setup. Thanks, >>>>>>>>> >>>>>>>>> Josef >>>>>>>> >>>>>>>> Not commenting on the code itself, but as a user I see this SSD RAID1 >>>>>>>> acceleration as a future much have feature. It's only obscure at the >>>>>>>> moment because we don't have code to take advantage of it. But on >>>>>>>> large btrfs filesystems with hundreds of GB of metadata, like I have >>>>>>>> for backups, the usability of the filesystem is dramatically improved >>>>>>>> having the metadata on an SSD( though currently only half of the time >>>>>>>> due to the even/odd pid distribution.) >>>>>>> >>>>>>> But that's different from a mirror. 100% it would be nice to say "put my >>>>>>> metadata on the ssd, data elsewhere". That's not what this patch is about, this >>>>>>> patch is specifically about changing which drive we choose in a mirrored setup, >>>>>>> which is super unlikely to mirror a SSD with a slow drive, cause it's just going >>>>>>> to be slow no matter what. Sure we could make it so reads always go to the SSD, >>>>>>> but we can accomplish that by just adding a check for nonrotational in the code, >>>>>>> and then we don't have to encode all this nonsense in the file system. Thanks, >>>>>> >>>>>> I wrote about the readmirror policy framework here[2], >>>>>> I forgot to link it here, sorry about that, my mistake. >>>>>> >>>>>> [2] >>>>>> >>>>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/ >>>>>> >>>>>> Readmirror policy is for raid1, raid10 and future N way mirror. >>>>>> Yes for now its only for raid1. >>>>>> >>>>>> Here the idea is to create a framework so that readmirror policy >>>>>> can be configured as needed. And nonrotational can be one such policy. >>>>>> >>>>>> The example of hard-coded nonrotational policy does not work in case >>>>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block >>>>>> device, as all these are still nonrotational devices. So hard-coded >>>>>> policy is not a good idea. If we have to hardcode then there is Q-depth >>>>>> based readmirror routing is better (patch in the ML), but that is >>>>>> not good enough, because some configs wants it based on the disk-LBA >>>>>> so that SAN storage target cache is balanced and not duplicated. >>>>>> So in short it must be a configurable policy. >>>>>> >>>>> >>>>> Again, if you are mixing disk types you likely always want non-rotational, but >>>>> still mixing different speed devices in a mirror setup is just asking for weird >>>>> latency problems. I don't think solving this use case is necessary. If you mix >>>>> ssd + network device in a serious production setup then you probably should be >>>>> fired cause you don't know what you are doing. Having the generic >>>>> "nonrotational gets priority" is going to cover 99% of the actual use cases that >>>>> make sense. >>>>> >>>>> The SAN usecase I can sort of see, but again I don't feel like it's a problem we >>>>> need to solve with on-disk format. Add a priority to sysfs so you can change it >>>>> with udev or something on the fly. Thanks, >>>>> >>>> >>>> Ok. >>>> Sysfs is fine however we need configuration to be persistent across reboots. >>>> Any idea? >>>> >>> >>> Udev rules. Thanks, >>> >> >> Josef, configs moving along with the luns in san seems to be more >> easy to manage, otherwise when the host fails, each potential new >> server has to be pre-configured with the udev rules. >> > > It's 2019, if people haven't figured out how to do persistent configuration by > now then all hope is lost. Facebook persistently configures millions of > machines, I'm sure people can figure out how to make sure a udev rule ends up on > the right host connected to a SAN that doesn't move. Thanks, > > Josef > yeah. sysfs interface should be fine for now, and based on the usage feedback we could consider persistent storage for the readmirror. Typical to WORM (Write Once Read Many times) setups they mainly need good read performance as data is imported once from somewhere else. Just out of curiosity ran read performance checks using the readmirror property on ssd and nvme device. $ btrfs fi show Label: none uuid: e24bd8b5-a9e9-41d4-b28e-e95c0c545466 Total devices 2 FS bytes used 391.64MiB devid 1 size 447.13GiB used 2.01GiB path /dev/sdf devid 2 size 5.82TiB used 2.01GiB path /dev/nvme0n1 $ cat /sys/block/sdf/queue/rotational 0 $ cat /sys/block/nvme0n1/queue/rotational 0 $ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:2 && time md5sum /btrfs/fOH2 6129ccf74f7a761e0c3e096e051ba7a2 /btrfs/fOH2 real 0m0.725s user 0m0.604s sys 0m0.113s $ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:1 && time md5sum /btrfs/fOH2 6129ccf74f7a761e0c3e096e051ba7a2 /btrfs/fOH2 real 0m1.125s user 0m0.643s sys 0m0.087s nvme provides ~40% improvement for 400mb read io, however there are no other IOs. The question will be what kind of read io load balancer will be required at the full throttle of the preferred readmirror device IO Q depth [1]. Which means a heuristic is required to juggle around the policies and will be another readmirror policy to be part of the same framework when required. [1] $ cat /sys/block/nvme0n1/queue/nr_requests 1023 $ cat /sys/block/sdf/queue/nr_requests 64 Thanks, Anand
On 9/16/19 4:19 PM, Anand Jain wrote: > On 12/9/19 6:13 PM, Josef Bacik wrote: >> On Thu, Sep 12, 2019 at 06:10:08PM +0800, Anand Jain wrote: >>> >>> >>>> On 12 Sep 2019, at 6:03 PM, Josef Bacik <josef@toxicpanda.com> wrote: >>>> >>>> On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote: >>>>> >>>>> >>>>>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote: >>>>>> >>>>>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote: >>>>>>> >>>>>>> >>>>>>> Thanks for the comments. More below. >>>>>>> >>>>>>> On 12/9/19 3:16 AM, Josef Bacik wrote: >>>>>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote: >>>>>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik >>>>>>>>> <josef@toxicpanda.com> wrote: >>>>>>>>>> >>>>>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote: >>>>>>>>>>> Function call chain __btrfs_map_block()->find_live_mirror() >>>>>>>>>>> uses >>>>>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0. >>>>>>>>>>> >>>>>>>>>>> This patch introduces a framework so that we can add policies >>>>>>>>>>> to determine >>>>>>>>>>> the %mirror_num. And also adds the devid as the readmirror >>>>>>>>>>> policy. >>>>>>>>>>> >>>>>>>>>>> The new property is stored as an item in the device tree as >>>>>>>>>>> show below. >>>>>>>>>>> (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, >>>>>>>>>>> devid) >>>>>>>>>>> >>>>>>>>>>> To be able to set and get this new property also introduces >>>>>>>>>>> new ioctls >>>>>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The >>>>>>>>>>> ioctl argument >>>>>>>>>>> is defined as >>>>>>>>>>> struct btrfs_ioctl_readmirror_args { >>>>>>>>>>> __u64 type; /* RW */ >>>>>>>>>>> __u64 device_bitmap; /* RW */ >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> An usage example as follows: >>>>>>>>>>> btrfs property set /btrfs readmirror devid:1,3 >>>>>>>>>>> btrfs property get /btrfs readmirror >>>>>>>>>>> readmirror devid:1 3 >>>>>>>>>>> btrfs property set /btrfs readmirror "" >>>>>>>>>>> btrfs property get /btrfs readmirror >>>>>>>>>>> readmirror default >>>>>>>>>>> >>>>>>>>>>> This patchset has been tested completely, however marked as >>>>>>>>>>> RFC for the >>>>>>>>>>> following reasons and comments on them (or any other) are >>>>>>>>>>> appreciated as >>>>>>>>>>> usual. >>>>>>>>>>> . The new objectid is defined as >>>>>>>>>>> #define BTRFS_READMIRROR_OBJECTID -1ULL >>>>>>>>>>> Need consent we are fine to use this value, and with this >>>>>>>>>>> value it >>>>>>>>>>> shall be placed just before the DEV_STATS_OBJECTID item >>>>>>>>>>> which is more >>>>>>>>>>> frequently used only during the device errors. >>>>>>>>>>> >>>>>>>>>>> . I am using a u64 bitmap to represent the devices id, so >>>>>>>>>>> the max device >>>>>>>>>>> id that we could represent is 63, its a kind of limitation >>>>>>>>>>> which should >>>>>>>>>>> be addressed before integration, I wonder if there is any >>>>>>>>>>> suggestion? >>>>>>>>>>> Kindly note that, multiple ioctls with each time >>>>>>>>>>> representing a set of >>>>>>>>>>> device(s) is not a choice because we need to make sure the >>>>>>>>>>> readmirror >>>>>>>>>>> changes happens in a commit transaction. >>>>>>>>>>> >>>>>>>>>>> v1->RFC v2: >>>>>>>>>>> . Property is stored as a dev-tree item instead of root >>>>>>>>>>> inode extended >>>>>>>>>>> attribute. >>>>>>>>>>> . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to >>>>>>>>>>> BTRFS_DEV_STATE_READ_PREFERRED. >>>>>>>>>>> . Changed format specifier from devid1,2,3.. to devid:1,2,3.. >>>>>>>>>>> >>>>>>>>>>> RFC->v1: >>>>>>>>>>> Drops pid as one of the readmirror policy choices and as >>>>>>>>>>> usual remains >>>>>>>>>>> as default. And when the devid is reset the readmirror >>>>>>>>>>> policy falls back >>>>>>>>>>> to pid. >>>>>>>>>>> Drops the mount -o readmirror idea, it can be added at a >>>>>>>>>>> later point of >>>>>>>>>>> time. >>>>>>>>>>> Property now accepts more than 1 devid as readmirror >>>>>>>>>>> device. As shown >>>>>>>>>>> in the example above. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> This is a lot of infrastructure >>>>>>> >>>>>>> Ok. Any idea on a better implementation? >>>>>>> How about extended attribute approach? v1 patches proposed >>>>>>> it, but it abused the extended attribute as commented here [1] >>>>>>> and v2 got changed to an item-key. >>>>>>> >>>>>>> [1] >>>>>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/ >>>>>>> >>>>>>> >>>>>> >>>>>> That's a NAK on the prop interface. This is a fs wide policy, not a >>>>>> directory/inode policy. >>>>>> >>>>>>> >>>>>>>>>> to just change which mirror we read to based on >>>>>>>>>> some arbitrary user policy. I assume this is to solve the >>>>>>>>>> case where you have >>>>>>>>>> slow and fast disks, so you can always read from the fast >>>>>>>>>> disk? And then it's >>>>>>>>>> only used in RAID1, so the very narrow usecase of having a >>>>>>>>>> RAID1 setup with a >>>>>>>>>> SSD and a normal disk? I'm not seeing a point to this much >>>>>>>>>> code for one >>>>>>>>>> particular obscure setup. Thanks, >>>>>>>>>> >>>>>>>>>> Josef >>>>>>>>> >>>>>>>>> Not commenting on the code itself, but as a user I see this SSD >>>>>>>>> RAID1 >>>>>>>>> acceleration as a future much have feature. It's only obscure >>>>>>>>> at the >>>>>>>>> moment because we don't have code to take advantage of it. But on >>>>>>>>> large btrfs filesystems with hundreds of GB of metadata, like I >>>>>>>>> have >>>>>>>>> for backups, the usability of the filesystem is dramatically >>>>>>>>> improved >>>>>>>>> having the metadata on an SSD( though currently only half of >>>>>>>>> the time >>>>>>>>> due to the even/odd pid distribution.) >>>>>>>> >>>>>>>> But that's different from a mirror. 100% it would be nice to >>>>>>>> say "put my >>>>>>>> metadata on the ssd, data elsewhere". That's not what this >>>>>>>> patch is about, this >>>>>>>> patch is specifically about changing which drive we choose in a >>>>>>>> mirrored setup, >>>>>>>> which is super unlikely to mirror a SSD with a slow drive, cause >>>>>>>> it's just going >>>>>>>> to be slow no matter what. Sure we could make it so reads >>>>>>>> always go to the SSD, >>>>>>>> but we can accomplish that by just adding a check for >>>>>>>> nonrotational in the code, >>>>>>>> and then we don't have to encode all this nonsense in the file >>>>>>>> system. Thanks, >>>>>>> >>>>>>> I wrote about the readmirror policy framework here[2], >>>>>>> I forgot to link it here, sorry about that, my mistake. >>>>>>> >>>>>>> [2] >>>>>>> >>>>>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/ >>>>>>> >>>>>>> >>>>>>> Readmirror policy is for raid1, raid10 and future N way mirror. >>>>>>> Yes for now its only for raid1. >>>>>>> >>>>>>> Here the idea is to create a framework so that readmirror policy >>>>>>> can be configured as needed. And nonrotational can be one such >>>>>>> policy. >>>>>>> >>>>>>> The example of hard-coded nonrotational policy does not work in case >>>>>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME >>>>>>> block >>>>>>> device, as all these are still nonrotational devices. So hard-coded >>>>>>> policy is not a good idea. If we have to hardcode then there is >>>>>>> Q-depth >>>>>>> based readmirror routing is better (patch in the ML), but that is >>>>>>> not good enough, because some configs wants it based on the disk-LBA >>>>>>> so that SAN storage target cache is balanced and not duplicated. >>>>>>> So in short it must be a configurable policy. >>>>>>> >>>>>> >>>>>> Again, if you are mixing disk types you likely always want >>>>>> non-rotational, but >>>>>> still mixing different speed devices in a mirror setup is just >>>>>> asking for weird >>>>>> latency problems. I don't think solving this use case is >>>>>> necessary. If you mix >>>>>> ssd + network device in a serious production setup then you >>>>>> probably should be >>>>>> fired cause you don't know what you are doing. Having the generic >>>>>> "nonrotational gets priority" is going to cover 99% of the actual >>>>>> use cases that >>>>>> make sense. >>>>>> >>>>>> The SAN usecase I can sort of see, but again I don't feel like >>>>>> it's a problem we >>>>>> need to solve with on-disk format. Add a priority to sysfs so you >>>>>> can change it >>>>>> with udev or something on the fly. Thanks, >>>>>> >>>>> >>>>> Ok. >>>>> Sysfs is fine however we need configuration to be persistent across >>>>> reboots. >>>>> Any idea? >>>>> >>>> >>>> Udev rules. Thanks, >>>> >>> >>> Josef, configs moving along with the luns in san seems to be more >>> easy to manage, otherwise when the host fails, each potential new >>> server has to be pre-configured with the udev rules. >>> >> >> It's 2019, if people haven't figured out how to do persistent >> configuration by >> now then all hope is lost. Facebook persistently configures millions of >> machines, I'm sure people can figure out how to make sure a udev rule >> ends up on >> the right host connected to a SAN that doesn't move. Thanks, >> >> Josef >> > > yeah. sysfs interface should be fine for now, and based on the > usage feedback we could consider persistent storage for the readmirror. > > Typical to WORM (Write Once Read Many times) setups they mainly need > good read performance as data is imported once from somewhere else. > Just out of curiosity ran read performance checks using the readmirror > property on ssd and nvme device. > > $ btrfs fi show > Label: none uuid: e24bd8b5-a9e9-41d4-b28e-e95c0c545466 > Total devices 2 FS bytes used 391.64MiB > devid 1 size 447.13GiB used 2.01GiB path /dev/sdf > devid 2 size 5.82TiB used 2.01GiB path /dev/nvme0n1 > > $ cat /sys/block/sdf/queue/rotational > 0 > $ cat /sys/block/nvme0n1/queue/rotational > 0 > > $ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:2 && time > md5sum /btrfs/fOH2 > 6129ccf74f7a761e0c3e096e051ba7a2 /btrfs/fOH2 > > real 0m0.725s > user 0m0.604s > sys 0m0.113s > > $ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:1 && time > md5sum /btrfs/fOH2 > 6129ccf74f7a761e0c3e096e051ba7a2 /btrfs/fOH2 > > real 0m1.125s > user 0m0.643s > sys 0m0.087s > > nvme provides ~40% improvement for 400mb read io, however there are no > other IOs. The question will be what kind of read io load balancer will > be required at the full throttle of the preferred readmirror > device IO Q depth [1]. Which means a heuristic is required to juggle > around the policies and will be another readmirror policy to be part > of the same framework when required. > > [1] > $ cat /sys/block/nvme0n1/queue/nr_requests > 1023 > $ cat /sys/block/sdf/queue/nr_requests > 64 > > Thanks, Anand Josef, Here I infer as follows .. I wonder if it makes sense. - Potential readmirror policies are: pid, devid/manual, lba/cache, Q-depth(patches are in the ml from facebook) each one of them has their own advantage so a framework is essential. - As the block device performance keeps evolving so its common to have heterogeneous block devices on the host. readmirror policy helps to configure for the best possible performance. - sysfs is a good interface for readmirror policy as in ioctl we can't pass all possible devices in the ioctl argument. - ondisk storage is essential for the readmirror, and key-item pair is the only way for its storage. (On the 2nd thought storage cannot be implemented at a later point as sysfs interface must be consistent in each version. So implementing together sysfs and ondisk key-item pair makes sense to me.) Thanks, Anand