Message ID | 10c4d38005e24bb477b5182d4fc17879139f14d9.1470753817.git.tom.ty89@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hello. On 08/09/2016 05:45 PM, tom.ty89@gmail.com wrote: > From: Tom Yan <tom.ty89@gmail.com> > > Currently block layer limit max_hw_sectors is set to > ATA_MAX_SECTORS_LBA48 (65535), for devices with LBA48 support. > > However, block layer limit max_sectors (which is the effective > one; also adjustable, upper-bounded by max_hw_sectors) is set to > BLK_DEF_MAX_SECTORS (currently 2560) by the scsi disk driver, > since libata's SATL does not report an Optimal Transfer Length. > > This does not make much sense, especially when the current > BLK_DEF_MAX_SECTORS appears to be unsafe for some ATA devices > (see ATA_HORKAGE_MAX_SEC_1024). Truth is, the current value > appears to be arbitrary anyway. See commit d2be537c3ba3 > ("block: bump BLK_DEF_MAX_SECTORS to 2560"). > > Therefore, avoid setting dev->max_sectors when it is strictly > necessary. Leave it as 0 otherwise, so that both block layer Hmm, avoid it if it's strictly necessary? > limits will remain as SCSI_DEFAULT_MAX_SECTORS (currently 1024). You forgot to sign off on the patch. MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Tue, Aug 09, 2016 at 10:45:47PM +0800, tom.ty89@gmail.com wrote: > From: Tom Yan <tom.ty89@gmail.com> > > Currently block layer limit max_hw_sectors is set to > ATA_MAX_SECTORS_LBA48 (65535), for devices with LBA48 support. > > However, block layer limit max_sectors (which is the effective > one; also adjustable, upper-bounded by max_hw_sectors) is set to > BLK_DEF_MAX_SECTORS (currently 2560) by the scsi disk driver, > since libata's SATL does not report an Optimal Transfer Length. > > This does not make much sense, especially when the current > BLK_DEF_MAX_SECTORS appears to be unsafe for some ATA devices > (see ATA_HORKAGE_MAX_SEC_1024). Truth is, the current value > appears to be arbitrary anyway. See commit d2be537c3ba3 > ("block: bump BLK_DEF_MAX_SECTORS to 2560"). > > Therefore, avoid setting dev->max_sectors when it is strictly > necessary. Leave it as 0 otherwise, so that both block layer > limits will remain as SCSI_DEFAULT_MAX_SECTORS (currently 1024). These changes feel rather gratuitous. What's the upside here? Thanks.
I have to admit that libata may not be the right place to deal with my concern over the current BLK_DEF_MAX_SECTORS, which seems non-sensical to me. In the original commit message: (d2be537c3ba3, "block: bump BLK_DEF_MAX_SECTORS to 2560") "A value of 2560 (1280k) will accommodate a 10-data-disk stripe write with chunk size 128k...a value of 1280 does not show a big performance difference from 512, but will hopefully help software RAID setups using SATA disks, as reported by Christoph." So I have no idea at all why the bump was allowed. What so special about "10-data-disk stripe write with chunk size 128k", that we would want to make a block layer general default base on that? The macro appeared to be used by the aoeblk driver only. Yet since a later commit (ca369d51b3e1, "block/sd: Fix device-imposed transfer length limits"), the scsi disk driver will use it to set the effective max_sectors(_kb) for devices that does not report Optimal Transfer Length in the Block Limit VPD (which is the case of libata's SATL). So the consequence is, ATA drives with max_hw_sectors(_kb) (i.e. dev->max_sectors set in libata-core.c) set to higher than BLK_DEF_MAX_SECTORS will end up having it as the effective max_sectors. Not only the value is non-sensical, but also the logic. (Btw, ATA_HORKAGE_MAX_SEC_LBA48 will sort of become ineffective because of this.) Now let's just come back to libata. I've thought of reporting dev->max_sectors as Optimal Transfer Length in the SATL. However, I am not sure if it is a safe thing to do, because we set it as high as 65535 for devices with LBA48 devices. Does such a high max_sectors ever make sense in libata's case? That's why I took the approach of the USB Attached SCSI driver. That is, we only explicitly set max_hw_sectors in cases that is strictly necessary (e.g. no LBA48 support, horkages). Otherwise we'll leave request queue as-is. I've also thought of reporting dev->max_sectors as Optimal Transfer Length but cap it at 1024, so that users will still be allowed to bump max_sectors_kb up to 32767 kb, but we'll always start at 512 kb or less. But this approach have a side-effect that I find ugly, that is the block layer limit "io_opt" could be set to a value that is not max_hw_sectors(_kb). I have also no idea about the consequence of having io_opt set. On 10 August 2016 at 12:10, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Tue, Aug 09, 2016 at 10:45:47PM +0800, tom.ty89@gmail.com wrote: >> From: Tom Yan <tom.ty89@gmail.com> >> >> Currently block layer limit max_hw_sectors is set to >> ATA_MAX_SECTORS_LBA48 (65535), for devices with LBA48 support. >> >> However, block layer limit max_sectors (which is the effective >> one; also adjustable, upper-bounded by max_hw_sectors) is set to >> BLK_DEF_MAX_SECTORS (currently 2560) by the scsi disk driver, >> since libata's SATL does not report an Optimal Transfer Length. >> >> This does not make much sense, especially when the current >> BLK_DEF_MAX_SECTORS appears to be unsafe for some ATA devices >> (see ATA_HORKAGE_MAX_SEC_1024). Truth is, the current value >> appears to be arbitrary anyway. See commit d2be537c3ba3 >> ("block: bump BLK_DEF_MAX_SECTORS to 2560"). >> >> Therefore, avoid setting dev->max_sectors when it is strictly >> necessary. Leave it as 0 otherwise, so that both block layer >> limits will remain as SCSI_DEFAULT_MAX_SECTORS (currently 1024). > > These changes feel rather gratuitous. What's the upside here? > > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Tom. On Wed, Aug 10, 2016 at 04:32:39PM +0800, Tom Yan wrote: > I have to admit that libata may not be the right place to deal with my > concern over the current BLK_DEF_MAX_SECTORS, which seems non-sensical > to me. In the original commit message: > > (d2be537c3ba3, "block: bump BLK_DEF_MAX_SECTORS to 2560") > "A value of 2560 (1280k) will accommodate a 10-data-disk stripe write > with chunk size 128k...a value of 1280 does not show a big performance > difference from 512, but will hopefully help software RAID setups > using SATA disks, as reported by Christoph." > > So I have no idea at all why the bump was allowed. What so special In general, people are okay with upping the limits as long as there are actual use cases which can benefit. The devices are getting constantly faster and the hardware limits are going up along with it after all. > about "10-data-disk stripe write with chunk size 128k", that we would > want to make a block layer general default base on that? It's not special. It's just an actual use case which can benefit from the raised limit. > The macro appeared to be used by the aoeblk driver only. Yet since a > later commit (ca369d51b3e1, "block/sd: Fix device-imposed transfer > length limits"), the scsi disk driver will use it to set the effective > max_sectors(_kb) for devices that does not report Optimal Transfer > Length in the Block Limit VPD (which is the case of libata's SATL). > > So the consequence is, ATA drives with max_hw_sectors(_kb) (i.e. > dev->max_sectors set in libata-core.c) set to higher than > BLK_DEF_MAX_SECTORS will end up having it as the effective > max_sectors. Not only the value is non-sensical, but also the logic. > (Btw, ATA_HORKAGE_MAX_SEC_LBA48 will sort of become ineffective > because of this.) > > Now let's just come back to libata. I've thought of reporting > dev->max_sectors as Optimal Transfer Length in the SATL. However, I am > not sure if it is a safe thing to do, because we set it as high as > 65535 for devices with LBA48 devices. Does such a high max_sectors > ever make sense in libata's case? libata is just exposing whatever the underlying hardware supports. Whether that should be exposed as optimal transfer length is a different matter. > That's why I took the approach of the USB Attached SCSI driver. That > is, we only explicitly set max_hw_sectors in cases that is strictly > necessary (e.g. no LBA48 support, horkages). Otherwise we'll leave > request queue as-is. I see. Alright, let's do that. Can you please respin the patch with more detailed description? Thanks.
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
Tom,
Tom> Now let's just come back to libata. I've thought of reporting dev->
Tom> max_sectors as Optimal Transfer Length in the SATL. However, I am
Tom> not sure if it is a safe thing to do, because we set it as high as
Tom> 65535 for devices with LBA48 devices. Does such a high max_sectors
Tom> ever make sense in libata's case?
I don't agree with conflating the optimal transfer size and the maximum
supported ditto. Submitting the largest possible I/O to a device does
not guarantee that you get the best overall performance.
- max_hw_sectors is gated by controller DMA constraints.
- max_dev_sectors is set for devices that explicitly report a transfer
length limit.
- max_sectors, the soft limit for filesystem read/write requests,
should be left at BLK_DEF_MAX_SECTORS unless the device explicitly
requests transfers to be aligned multiples of a different value
(typically the internal stripe size in large arrays).
The point of BLK_DEF_MAX_SECTORS is to offer a reasonable default for
common workloads unless otherwise instructed by the storage device.
We can have a discussion about what the right value for
BLK_DEF_MAX_SECTORS should be. It has gone up over time but it used to
be the case that permitting large transfers significantly impacted
interactive I/O performance. And finding a sweet spot that works for a
wide variety of hardware, interconnects and workloads is obviously
non-trivial.
On 11 August 2016 at 11:37, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes: > > I don't agree with conflating the optimal transfer size and the maximum > supported ditto. Submitting the largest possible I/O to a device does > not guarantee that you get the best overall performance. > > - max_hw_sectors is gated by controller DMA constraints. > > - max_dev_sectors is set for devices that explicitly report a transfer > length limit. > > - max_sectors, the soft limit for filesystem read/write requests, > should be left at BLK_DEF_MAX_SECTORS unless the device explicitly > requests transfers to be aligned multiples of a different value > (typically the internal stripe size in large arrays). Shouldn't we use Maximum Transfer Length to derive max_sectors (and get rid of the almost useless max_dev_sectors)? Honestly it looks pretty non-sensical to me that the SCSI disk driver uses Optimal Transfer Length for max_sectors. Also in libata's case, this make setting the effective max_sectors (e.g. see ATA_HORKAGE_MAX_SEC_LBA48) impossible if we do not want to touch io_opt. It would look to me that our block layer simply have a flawed design if we really need to derive both io_opt and max_sectors from the same field. > > The point of BLK_DEF_MAX_SECTORS is to offer a reasonable default for > common workloads unless otherwise instructed by the storage device. > > We can have a discussion about what the right value for > BLK_DEF_MAX_SECTORS should be. It has gone up over time but it used to > be the case that permitting large transfers significantly impacted > interactive I/O performance. And finding a sweet spot that works for a > wide variety of hardware, interconnects and workloads is obviously > non-trivial. > If BLK_DEF_MAX_SECTORS is supposed to be used as a fallback, then it should be a safe value, especially when max_sectors_kb can be adjusted through sysfs. But the biggest problem isn't on bumping it, but the value picked is totally irrational for a general default. I mean, given that it was 1024 (512k), try to double it? Fine. Try to quadruple it? Alright. We'll need to deal with some alignment / boundary issue (like the typical 65535 vs 65536 case)? Okay let's do it. But what's the sense in picking a random RAID configuartion as the base to decide the default? Also, if max_sectors need to concern about the number of disks used and chunk sizes in a RAID configuartion, it should be calculated in the device-mapper layer or a specific driver or so. Changing a block layer default won't help anyway. Say 2560 will accomodate a 10-disk 128k-chunk RAID. What about a 12-disk 128k-chunk RAID then? Why not just decide the value base on an 8-disk 128k-chunk RAID, which HAPPENED to be a double of 1024 as well? It does not make sense that the SCSI disk driver uses it as the fallback either. SCSI host templates that does not have max_sectors set (as well as some specific driver) will use SCSI_DEFAULT_MAX_SECTORS as the fallback, for such hosts max_hw_sectors will be 1024, where the current BLK_DEF_MAX_SECTORS cannot apply as max_sectors anyway. So we should use also SCSI_DEFAULT_MAX_SECTORS in the SCSI disk driver as fallback for max_sectors. If the value is considered to low even as a safe fallback, then it should be bumped appropriately. (Or we might want to replace it with BLK_DEF_MAX_SECTORS everywhere in the SCSI layer, that said, after the value is fixed.) > -- > Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
Hey Tom,
Tom> Shouldn't we use Maximum Transfer Length to derive max_sectors (and
Tom> get rid of the almost useless max_dev_sectors)?
MAXIMUM TRANSFER LENGTH could be gigabytes. Some disks report it as the
full capacity of the device.
Again, the point of max_hw_sectors and max_dev_sectors is to enforce the
hard limits of controller and device respectively. Nothing else.
max_sectors is a soft limit for filesystem read/write requests and is
chosen to be a sane default for common workloads. It can be bumped (up
to the smaller of the hard limits) if the user so desires.
Tom> Honestly it looks pretty non-sensical to me that the SCSI disk
Tom> driver uses Optimal Transfer Length for max_sectors.
OPTIMAL TRANSFER LENGTH is there to address a very specific problem,
namely avoiding partial stripe writes on RAID arrays.
For almost all other device classes it is not reported and it makes no
sense to do so. For those remaining devices BLK_DEF_MAX_SECTORS comes
into play.
Tom> But the biggest problem isn't on bumping it, but the value picked
Tom> is totally irrational for a general default. I mean, given that it
Tom> was 1024 (512k), try to double it? Fine. Try to quadruple it?
Tom> Alright. We'll need to deal with some alignment / boundary issue
Tom> (like the typical 65535 vs 65536 case)? Okay let's do it. But
Tom> what's the sense in picking a random RAID configuartion as the base
Tom> to decide the default?
I agree that the new default is completely arbitrary. And I think there
was a bit of controversy at the time. However, the numbers were
compelling so the change went in.
We are open to a discussion about what the default should be. But it
involves backing the discussion up with solid data.
Tom> So we should use also SCSI_DEFAULT_MAX_SECTORS in the SCSI disk
Tom> driver as fallback for max_sectors. If the value is considered to
Tom> low even as a safe fallback, then it should be bumped
Tom> appropriately. (Or we might want to replace it with
Tom> BLK_DEF_MAX_SECTORS everywhere in the SCSI layer, that said, after
Tom> the value is fixed.)
SCSI_DEFAULT_MAX_SECTORS is there to provide a safe max for legacy SCSI
controller drivers that haven't been updated or tested with larger
transfers. It has nothing to do with sd drive limits or block layer
defaults.
On 12 August 2016 at 10:01, Martin K. Petersen <martin.petersen@oracle.com> wrote: > > Again, the point of max_hw_sectors and max_dev_sectors is to enforce the > hard limits of controller and device respectively. Nothing else. > Sounds like libata-scsi is doing something wrong then. It should not set max_hw_sectors to dev->max_sectors that is set by libata-core: > blk_queue_max_hw_sectors(q, dev->max_sectors); but instead it should report it as Maximum Transfer Length and let sd set it as max_dev_sectors: > put_unaligned_be32(dev->max_sectors, &rbuf[8]); While max_hw_sectors will be left untouched (in the case of AHCI, for example, since its SCSI host template does not have max_sectors set; so max_hw_sectors will be SCSI_DEFAULT_MAX_SECTORS). Although that means setting dev->max_sectors to a value larger than 1024 will probably be a no-op, if that's really an issue, we should have the host templates in libata updated. Make sense? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12 August 2016 at 13:18, Tom Yan <tom.ty89@gmail.com> wrote: > On 12 August 2016 at 10:01, Martin K. Petersen > <martin.petersen@oracle.com> wrote: >> >> Again, the point of max_hw_sectors and max_dev_sectors is to enforce the >> hard limits of controller and device respectively. Nothing else. >> > > Sounds like libata-scsi is doing something wrong then. It should not > set max_hw_sectors to dev->max_sectors that is set by libata-core: > >> blk_queue_max_hw_sectors(q, dev->max_sectors); Though we'll still have to "abuse" max_hw_sectors for ATAPI class devices, since neither the SATL or sr cares about VPD. The only devices that need ATA_HORKAGE_MAX_SEC_LBA48 are ATAPI devices as well. > > but instead it should report it as Maximum Transfer Length and let sd > set it as max_dev_sectors: > >> put_unaligned_be32(dev->max_sectors, &rbuf[8]); > > While max_hw_sectors will be left untouched (in the case of AHCI, for > example, since its SCSI host template does not have max_sectors set; > so max_hw_sectors will be SCSI_DEFAULT_MAX_SECTORS). > > Although that means setting dev->max_sectors to a value larger than > 1024 will probably be a no-op, if that's really an issue, we should > have the host templates in libata updated. > > Make sense? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Tom> is totally irrational for a general default. I mean, given that it > Tom> was 1024 (512k), try to double it? Fine. Try to quadruple it? > Tom> Alright. We'll need to deal with some alignment / boundary issue > Tom> (like the typical 65535 vs 65536 case)? Okay let's do it. But > Tom> what's the sense in picking a random RAID configuartion as the base > Tom> to decide the default? > > I agree that the new default is completely arbitrary. And I think there > was a bit of controversy at the time. However, the numbers were > compelling so the change went in. For older SCSI and especially ATA drives (and it wouldn't surprise me if it is true of modern ones) there are also huge latency tradeoffs. Before you jump up and down about numbers what are the latency numbers like on classic ATA drives with that sized block I/O. You could easily up your RAID numbers while wrecking realtime and desktop performance. Without NCQ you certainly don't want huge I/O's blocking up the controller, with NCQ it depends if the firmware these days has improved and understands any kind of fairness or like the early ones and older SCSI drives will happily starve requests. Conceptually it also seems wrong - if you want to avoid partial stripe writes on a consumer grade disk subsystem then use smaller stripes. At the ATA level we can detect both the presence of command queueing ability, and also whether the device is spinnning rust or not, so it may be smarter defaults could be done based upon whether the device is an SSD or not. Alan -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
Tom,
Tom> Sounds like libata-scsi is doing something wrong then. It should
Tom> not set max_hw_sectors to dev->max_sectors that is set by
Tom> libata-core:
Tom> but instead it should report it as Maximum Transfer Length and let
Tom> sd set it as max_dev_sectors:
Well, it's mostly academic since there is a 1:1 mapping. It is important
in SCSI where you have many devices with different properties hooked up
to the same controller.
SCSI MTL is a recent addition. There wasn't much point in making a
distinction between hw and dev outside of sd.c since it is the only
device type that has a way to communicate a transfer limit.
Tom> Although that means setting dev->max_sectors to a value larger than
Tom> 1024 will probably be a no-op, if that's really an issue, we should
Tom> have the host templates in libata updated.
I am still not sure exactly which problem it is you are trying to fix?
>>>>> "Alan" == One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> writes:
Alan,
Alan> For older SCSI and especially ATA drives (and it wouldn't surprise
Alan> me if it is true of modern ones) there are also huge latency
Alan> tradeoffs.
Absolutely.
Alan> Before you jump up and down about numbers what are the latency
Alan> numbers like on classic ATA drives with that sized block I/O. You
Alan> could easily up your RAID numbers while wrecking realtime and
Alan> desktop performance.
I was in no way advocating raising the default, quite the contrary. We
have several customer workloads that got negatively impacted when the
limit was bumped.
My point was that arguing about what the default should be without any
data supporting the discussion is futile.
Alan> At the ATA level we can detect both the presence of command
Alan> queueing ability, and also whether the device is spinnning rust or
Alan> not, so it may be smarter defaults could be done based upon
Alan> whether the device is an SSD or not.
Sure.
However, modern SSDs actually trend towards smaller I/O sizes. NVMe is
128K by default, I think. So a higher default wouldn't make any
difference there.
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 0749f71..07259d8 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -2610,10 +2610,8 @@ int ata_dev_configure(struct ata_device *dev) dma_dir_string); } - /* determine max_sectors */ - dev->max_sectors = ATA_MAX_SECTORS; - if (dev->flags & ATA_DFLAG_LBA48) - dev->max_sectors = ATA_MAX_SECTORS_LBA48; + if (!(dev->flags & ATA_DFLAG_LBA48)) + dev->max_sectors = ATA_MAX_SECTORS; /* Limit PATA drive on SATA cable bridge transfers to udma5, 200 sectors */
From: Tom Yan <tom.ty89@gmail.com> Currently block layer limit max_hw_sectors is set to ATA_MAX_SECTORS_LBA48 (65535), for devices with LBA48 support. However, block layer limit max_sectors (which is the effective one; also adjustable, upper-bounded by max_hw_sectors) is set to BLK_DEF_MAX_SECTORS (currently 2560) by the scsi disk driver, since libata's SATL does not report an Optimal Transfer Length. This does not make much sense, especially when the current BLK_DEF_MAX_SECTORS appears to be unsafe for some ATA devices (see ATA_HORKAGE_MAX_SEC_1024). Truth is, the current value appears to be arbitrary anyway. See commit d2be537c3ba3 ("block: bump BLK_DEF_MAX_SECTORS to 2560"). Therefore, avoid setting dev->max_sectors when it is strictly necessary. Leave it as 0 otherwise, so that both block layer limits will remain as SCSI_DEFAULT_MAX_SECTORS (currently 1024).