Message ID | 20210812075015.1090959-1-damien.lemoal@wdc.com (mailing list archive) |
---|---|
Headers | show |
Series | Initial support for multi-actuator HDDs | expand |
On 8/12/21 1:50 AM, Damien Le Moal wrote: > Single LUN multi-actuator hard-disks are cappable to seek and execute > multiple commands in parallel. This capability is exposed to the host > using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA). > Each positioning range describes the contiguous set of LBAs that an > actuator serves. > > This series adds support the scsi disk driver to retreive this > information and advertize it to user space through sysfs. libata is also > modified to handle ATA drives. > > The first patch adds the block layer plumbing to expose concurrent > sector ranges of the device through sysfs as a sub-directory of the > device sysfs queue directory. Patch 2 and 3 add support to sd and > libata. Finally patch 4 documents the sysfs queue attributed changes. > Patch 5 fixes a typo in the document file (strictly speaking, not > related to this series). > > This series does not attempt in any way to optimize accesses to > multi-actuator devices (e.g. block IO scheduler or filesystems). This > initial support only exposes the actuators information to user space > through sysfs. This looks good to me now - are we taking this through the block tree? If so, would be nice to get a SCSI signoff on patch 2.
Jens, > This looks good to me now - are we taking this through the block tree? > If so, would be nice to get a SCSI signoff on patch 2. I'd like to do one more review pass. Also, it may be best to take patch 2 through my tree since I suspect it will interfere with my VPD shuffle.
On 8/12/21 11:21 AM, Martin K. Petersen wrote: > > Jens, > >> This looks good to me now - are we taking this through the block tree? >> If so, would be nice to get a SCSI signoff on patch 2. > > I'd like to do one more review pass. Also, it may be best to take patch > 2 through my tree since I suspect it will interfere with my VPD shuffle. That's fine, we can do it like that. I'll wait a day or two and pending any last minute changes, take 1,3-5 through block.
On 2021/08/13 2:25, Jens Axboe wrote: > On 8/12/21 11:21 AM, Martin K. Petersen wrote: >> >> Jens, >> >>> This looks good to me now - are we taking this through the block tree? >>> If so, would be nice to get a SCSI signoff on patch 2. >> >> I'd like to do one more review pass. Also, it may be best to take patch >> 2 through my tree since I suspect it will interfere with my VPD shuffle. > > That's fine, we can do it like that. I'll wait a day or two and pending > any last minute changes, take 1,3-5 through block. Jens, Martin, Thanks. Without patch 1, patch 2 will cause compilation errors (disk_alloc_cranges() and disk_register_cranges() will be undefined). So may be both patch 1 & 2 should got through the scsi tree ?
Hi Damien! > Single LUN multi-actuator hard-disks are cappable to seek and execute > multiple commands in parallel. This capability is exposed to the host > using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA). > Each positioning range describes the contiguous set of LBAs that an > actuator serves. I am not a big fan of the Concurrent Positioning Range terminology since it is very specific to the implementation of multi-actuator disk drives. With other types of media, "positioning" doesn't any sense. It is unfortunate that CPR is the term that ended up in a spec that covers a wide variety of devices and media types. I also think that the "concurrent positioning" emphasizes the performance aspect but not so much the fault domain which in many ways is the more interesting part. The purpose of exposing this information to the filesystems must be to encourage them to use it. And therefore I think it is important that the semantics and information conveyed is applicable outside of the multi-actuator use case. It would be easy to expose this kind of information for concatenated LVM devices, etc. Anyway. I don't really have any objections to the series from an implementation perspective. I do think "cpr" as you used in patch #2 is a better name than "crange". But again, I wish we could come up with a more accurate and less disk actuator centric terminology for the block layer plumbing. I would have voted for "fault_domain" but that ignores the performance aspect. "independent_block_range", maybe? Why is naming always so hard? :(
Damien, > Without patch 1, patch 2 will cause compilation errors > (disk_alloc_cranges() and disk_register_cranges() will be > undefined). So may be both patch 1 & 2 should got through the scsi > tree ? The sd patch would go into -rc2.
On 2021/08/17 2:10, Martin K. Petersen wrote: > > Hi Damien! > >> Single LUN multi-actuator hard-disks are cappable to seek and execute >> multiple commands in parallel. This capability is exposed to the host >> using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA). >> Each positioning range describes the contiguous set of LBAs that an >> actuator serves. > > I am not a big fan of the Concurrent Positioning Range terminology since > it is very specific to the implementation of multi-actuator disk drives. > With other types of media, "positioning" doesn't any sense. It is > unfortunate that CPR is the term that ended up in a spec that covers a > wide variety of devices and media types. > > I also think that the "concurrent positioning" emphasizes the > performance aspect but not so much the fault domain which in many ways > is the more interesting part. > > The purpose of exposing this information to the filesystems must be to > encourage them to use it. And therefore I think it is important that the > semantics and information conveyed is applicable outside of the > multi-actuator use case. It would be easy to expose this kind of > information for concatenated LVM devices, etc. > > Anyway. I don't really have any objections to the series from an > implementation perspective. I do think "cpr" as you used in patch #2 is > a better name than "crange". But again, I wish we could come up with a > more accurate and less disk actuator centric terminology for the block > layer plumbing. > > I would have voted for "fault_domain" but that ignores the performance > aspect. "independent_block_range", maybe? Why is naming always so hard? > :( I did struggle with the naming too and crange was the best I could come up with given the specs wording. With the single LUN approach, the fault domain does not really change from a regular device. The typical use in case of bad heads would be to replace the drive or reformat it at lower capacity with head depop. That could be avoided with dm-linear on top (one DM per actuator) though. As for the independent_block_range idea, I thought of that too, but the problem is that the tags are still shared between the 2 actuators, so the ranges are not really independent at all. One can starve the other depending on the host workload without FS and/or IO scheduler optimization distributing the commands between the actuators. The above point led me to this informational only implementation. Without optimization, we get the same as today. No changes in performance and use. Better IOPS is gain for lucky workloads (typically random ones). Going forward, more reliable IOPS & throughput gains are possible with some additional changes.
On 2021/08/17 2:12, Martin K. Petersen wrote: > > Damien, > >> Without patch 1, patch 2 will cause compilation errors >> (disk_alloc_cranges() and disk_register_cranges() will be >> undefined). So may be both patch 1 & 2 should got through the scsi >> tree ? > > The sd patch would go into -rc2. Makes sense. Thanks !
Damien, > With the single LUN approach, the fault domain does not really change > from a regular device. The typical use in case of bad heads would be > to replace the drive or reformat it at lower capacity with head > depop. That could be avoided with dm-linear on top (one DM per > actuator) though. I was more thinking along the lines of btrfs making sure to place dup metadata on different actuators or similar. > The above point led me to this informational only implementation. > Without optimization, we get the same as today. No changes in > performance and use. Better IOPS is gain for lucky workloads > (typically random ones). Going forward, more reliable IOPS & > throughput gains are possible with some additional changes. Sure, but that means the ranges need to affect both I/O scheduling and data placement. We need to make sure that the data placement interface semantics are applicable to other types of media than multi actuator drives. Filesystems shouldn't have to look at different queue limits if they sit on top of dm-linear instead of sd. From an I/O scheduling perspective I concur that device implementation details are pertinent.
On 2021/08/18 11:24, Martin K. Petersen wrote: > > Damien, > >> With the single LUN approach, the fault domain does not really change >> from a regular device. The typical use in case of bad heads would be >> to replace the drive or reformat it at lower capacity with head >> depop. That could be avoided with dm-linear on top (one DM per >> actuator) though. > > I was more thinking along the lines of btrfs making sure to place > dup metadata on different actuators or similar. > >> The above point led me to this informational only implementation. >> Without optimization, we get the same as today. No changes in >> performance and use. Better IOPS is gain for lucky workloads >> (typically random ones). Going forward, more reliable IOPS & >> throughput gains are possible with some additional changes. > > Sure, but that means the ranges need to affect both I/O scheduling and > data placement. We need to make sure that the data placement interface > semantics are applicable to other types of media than multi actuator > drives. Filesystems shouldn't have to look at different queue limits if > they sit on top of dm-linear instead of sd. From an I/O scheduling > perspective I concur that device implementation details are pertinent. Currently, FSes cannot know the dm-linear config. The queue crange interface can actually unify this for split/dual actuator and dm-linear like logical disks. I need to send patches to dm-devel for that though as currently, dm-linear does not expose its config as cranges. If I recall correctly, Hannes was also interested in playing with that too :)
On 2021/08/12 16:50, Damien Le Moal wrote: > Single LUN multi-actuator hard-disks are cappable to seek and execute > multiple commands in parallel. This capability is exposed to the host > using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA). > Each positioning range describes the contiguous set of LBAs that an > actuator serves. > > This series adds support the scsi disk driver to retreive this > information and advertize it to user space through sysfs. libata is also > modified to handle ATA drives. > > The first patch adds the block layer plumbing to expose concurrent > sector ranges of the device through sysfs as a sub-directory of the > device sysfs queue directory. Patch 2 and 3 add support to sd and > libata. Finally patch 4 documents the sysfs queue attributed changes. > Patch 5 fixes a typo in the document file (strictly speaking, not > related to this series). > > This series does not attempt in any way to optimize accesses to > multi-actuator devices (e.g. block IO scheduler or filesystems). This > initial support only exposes the actuators information to user space > through sysfs. Hi Jens, I do not see this queued in for-5.15/block. Anything holding this series up ? Thanks !
On 2021/08/12 16:50, Damien Le Moal wrote: > Single LUN multi-actuator hard-disks are cappable to seek and execute > multiple commands in parallel. This capability is exposed to the host > using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA). > Each positioning range describes the contiguous set of LBAs that an > actuator serves. > > This series adds support the scsi disk driver to retreive this > information and advertize it to user space through sysfs. libata is also > modified to handle ATA drives. > > The first patch adds the block layer plumbing to expose concurrent > sector ranges of the device through sysfs as a sub-directory of the > device sysfs queue directory. Patch 2 and 3 add support to sd and > libata. Finally patch 4 documents the sysfs queue attributed changes. > Patch 5 fixes a typo in the document file (strictly speaking, not > related to this series). > > This series does not attempt in any way to optimize accesses to > multi-actuator devices (e.g. block IO scheduler or filesystems). This > initial support only exposes the actuators information to user space > through sysfs. Jens, Re-ping ? Anything against this series ? Can we get it queued for 5.15 ? > > Changes from v4: > * Fixed kdoc comment function name mismatch for disk_register_cranges() > in patch 1 > > Changes from v3: > * Modified patch 1: > - Prefix functions that take a struct gendisk as argument with > "disk_". Modified patch 2 accordingly. > - Added a functional release operation for struct blk_cranges kobj to > ensure that this structure is freed only after all references to it > are released, including kobject_del() execution for all crange sysfs > entries. > * Added patch 5 to separate the typo fix from the crange documentation > addition. > * Added reviewed-by tags > > Changes from v2: > * Update patch 1 to fix a compilation warning for a potential NULL > pointer dereference of the cr argument of blk_queue_set_cranges(). > Warning reported by the kernel test robot <lkp@intel.com>). > > Changes from v1: > * Moved libata-scsi hunk from patch 1 to patch 3 where it belongs > * Fixed unintialized variable in patch 2 > Reported-by: kernel test robot <lkp@intel.com> > Reported-by: Dan Carpenter <dan.carpenter@oracle.com > * Changed patch 3 adding struct ata_cpr_log to contain both the number > of concurrent ranges and the array of concurrent ranges. > * Added a note in the documentation (patch 4) about the unit used for > the concurrent ranges attributes. > > Damien Le Moal (5): > block: Add concurrent positioning ranges support > scsi: sd: add concurrent positioning ranges support > libata: support concurrent positioning ranges log > doc: document sysfs queue/cranges attributes > doc: Fix typo in request queue sysfs documentation > > Documentation/block/queue-sysfs.rst | 30 ++- > block/Makefile | 2 +- > block/blk-cranges.c | 310 ++++++++++++++++++++++++++++ > block/blk-sysfs.c | 26 ++- > block/blk.h | 4 + > drivers/ata/libata-core.c | 52 +++++ > drivers/ata/libata-scsi.c | 48 ++++- > drivers/scsi/sd.c | 81 ++++++++ > drivers/scsi/sd.h | 1 + > include/linux/ata.h | 1 + > include/linux/blkdev.h | 29 +++ > include/linux/libata.h | 15 ++ > 12 files changed, 580 insertions(+), 19 deletions(-) > create mode 100644 block/blk-cranges.c >
Damien,
> Re-ping ? Anything against this series ? Can we get it queued for 5.15 ?
Wrt. the choice of 'crange':
https://lore.kernel.org/linux-fsdevel/CAHk-=wiZ=wwa4oAA0y=Kztafgp0n+BDTEV6ybLoH2nvLBeJBLA@mail.gmail.com/
I share Linus' sentiment.
I really think 'crange' is a horribly non-descriptive name compared to
logical_block_size, max_sectors_kb, discard_max_bytes, and the other
things we export.
In addition, the recently posted copy offload patches also used
'crange', in that context to describe a 'copy range'.
Anyway. Just my opinion.
Jens: Feel free to add my Acked-by: to the sd pieces. My SCSI discovery
rework won't make 5.15.
On 2021/08/26 11:42, Martin K. Petersen wrote: > > Damien, > >> Re-ping ? Anything against this series ? Can we get it queued for 5.15 ? > > Wrt. the choice of 'crange': > > https://lore.kernel.org/linux-fsdevel/CAHk-=wiZ=wwa4oAA0y=Kztafgp0n+BDTEV6ybLoH2nvLBeJBLA@mail.gmail.com/ Yes, I have been reading this thread. Very good one. > > I share Linus' sentiment. > > I really think 'crange' is a horribly non-descriptive name compared to > logical_block_size, max_sectors_kb, discard_max_bytes, and the other > things we export. > > In addition, the recently posted copy offload patches also used > 'crange', in that context to describe a 'copy range'. Yes, saw that too. > Anyway. Just my opinion. Thanks for sharing. I am not super happy with the name either. I used this one as the least worst of possibilities I thought of. seek_range/srange ? -> that is very HDD centric and as we can reuse this for things like dm-linear on top of SSDs, that does not really work. I would prefer something that convey the idea of "parallel command execution", since this is the main point of the interface. prange ? cdm_range ? req_range ? Naming is really hard... > Jens: Feel free to add my Acked-by: to the sd pieces. My SCSI discovery > rework won't make 5.15. >
Damien, > I am not super happy with the name either. I used this one as the > least worst of possibilities I thought of. seek_range/srange ? -> > that is very HDD centric and as we can reuse this for things like > dm-linear on top of SSDs, that does not really work. I would prefer > something that convey the idea of "parallel command execution", since > this is the main point of the interface. prange ? cdm_range ? > req_range ? How about independent_access_range? That doesn't imply head positioning and can also be used to describe a fault domain. And it is less disk-centric than concurrent_positioning_range. I concur that those names are a bit unwieldy but at least they are somewhat descriptive. I consulted the thesaurus and didn't really like the other options (discrete, disjoint, separate, etc.). I think 'independent' is more accurate for this and better than 'concurrent' and 'parallel'.
On 2021/08/26 12:43, Martin K. Petersen wrote: > > Damien, > >> I am not super happy with the name either. I used this one as the >> least worst of possibilities I thought of. seek_range/srange ? -> >> that is very HDD centric and as we can reuse this for things like >> dm-linear on top of SSDs, that does not really work. I would prefer >> something that convey the idea of "parallel command execution", since >> this is the main point of the interface. prange ? cdm_range ? >> req_range ? > > How about independent_access_range? That doesn't imply head positioning > and can also be used to describe a fault domain. And it is less > disk-centric than concurrent_positioning_range. I like it, but a bit long-ish. Do you think shortening to access_range would be acceptable ? we would have: /sys/block/XYZ/queue/access_ranges/... and struct blk_access_range { ... sector_t sector; sector_t nr_sectors; } struct blk_access_range *arange; Adding independent does make everything even more obvious, but names become rather long. Not an issue for the sysfs directory I think, but struct blk_independent_access_range { ... sector_t sector; sector_t nr_sectors; } is rather a long struct name. Shortening independent to "ind" could very easily be confused with "indirection", so that is not an option. And "iaccess" is obscure... > > I concur that those names are a bit unwieldy but at least they are > somewhat descriptive. > > I consulted the thesaurus and didn't really like the other options > (discrete, disjoint, separate, etc.). I think 'independent' is more > accurate for this and better than 'concurrent' and 'parallel'. >
On 8/26/21 5:50 AM, Damien Le Moal wrote: > On 2021/08/26 12:43, Martin K. Petersen wrote: >> >> Damien, >> >>> I am not super happy with the name either. I used this one as the >>> least worst of possibilities I thought of. seek_range/srange ? -> >>> that is very HDD centric and as we can reuse this for things like >>> dm-linear on top of SSDs, that does not really work. I would prefer >>> something that convey the idea of "parallel command execution", since >>> this is the main point of the interface. prange ? cdm_range ? >>> req_range ? >> >> How about independent_access_range? That doesn't imply head positioning >> and can also be used to describe a fault domain. And it is less >> disk-centric than concurrent_positioning_range. > > I like it, but a bit long-ish. Do you think shortening to access_range would be > acceptable ? > > we would have: > > /sys/block/XYZ/queue/access_ranges/... > > and > > struct blk_access_range { > ... > sector_t sector; > sector_t nr_sectors; > } > > struct blk_access_range *arange; > > Adding independent does make everything even more obvious, but names become > rather long. Not an issue for the sysfs directory I think, but > > struct blk_independent_access_range { > ... > sector_t sector; > sector_t nr_sectors; > } > > is rather a long struct name. Shortening independent to "ind" could very easily > be confused with "indirection", so that is not an option. And "iaccess" is > obscure... > I'd vote for access_range. Cheers, Hannes
Damien, > I like it, but a bit long-ish. Do you think shortening to access_range > would be acceptable ? But doesn't 'access_range' imply that there are ranges that you can't access? I think 'independent' is more important and 'access' is just a clarification. > Adding independent does make everything even more obvious, but names become > rather long. Not an issue for the sysfs directory I think, but I do think it's important that the sysfs directory in particular is the full thing. It's a user-visible interface. If the internal interfaces have a shorthand I guess that's OK. > struct blk_independent_access_range { > ... > sector_t sector; > sector_t nr_sectors; > } > > is rather a long struct name. True, but presumably you'd do: struct blk_independent_access_range *iar; in a variable declaration and be done with it. So I don't think the type is a big deal. Where it becomes unwieldy is: blk_rq_independent_access_range_frobnicate(); Anyway. Running out of ideas. autonomous_range? sequestered_range?
On 2021/08/27 12:03, Martin K. Petersen wrote: > > Damien, > >> I like it, but a bit long-ish. Do you think shortening to access_range >> would be acceptable ? > > But doesn't 'access_range' imply that there are ranges that you can't > access? I think 'independent' is more important and 'access' is just a > clarification. > >> Adding independent does make everything even more obvious, but names become >> rather long. Not an issue for the sysfs directory I think, but > > I do think it's important that the sysfs directory in particular is the > full thing. It's a user-visible interface. Totally agree on that. > If the internal interfaces have a shorthand I guess that's OK. > >> struct blk_independent_access_range { >> ... >> sector_t sector; >> sector_t nr_sectors; >> } >> >> is rather a long struct name. > > True, but presumably you'd do: > > struct blk_independent_access_range *iar; > > in a variable declaration and be done with it. So I don't think the type > is a big deal. Where it becomes unwieldy is: > > blk_rq_independent_access_range_frobnicate(); OK. Let me rework the patches with the full blk_independent_access_range type name to see what everything becomes. > > Anyway. Running out of ideas. autonomous_range? sequestered_range? Arg. I prefer independent over autonomous. And "sequestered" is used in the context of SCSI/ATA command duration limits so I would rather not use that. I am out of ideas too. Let me try to see with "independent".