Message ID | 833713246c1a70d85150289c1537797d6f7dd606.1470903899.git.tom.ty89@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Aug 11, 2016 at 3:26 AM, <tom.ty89@gmail.com> wrote: > From: Tom Yan <tom.ty89@gmail.com> > > Currently we advertise Maximum Write Same Length based on the > maximum number of sectors that one-block TRIM payload can cover. > The field are used to derived discard_max_bytes and > write_same_max_bytes limits in the block layer, which currently can > at max be 0xffffffff (32-bit). > > However, with a AF 4Kn drive, the derived limits would be 65535 * > 64 * 4096 = 0x3fffc0000 (34-bit). Therefore, we now devide > ATA_MAX_TRIM_RNUM with (logical sector size / 512), so that the > derived limits will not overflow. > > The limits are now also consistent among drives with different > logical sector sizes. (Although that may or may not be what we > want ultimately when the SCSI / block layer allows larger > representation in the future.) > > Although 4Kn ATA SSDs may not be a thing on the market yet, this > patch is necessary for forthcoming SCT Write Same translation > support, which could be available on traditional HDDs where 4Kn is > already a thing. Also it should not change the current behavior on > drives with 512-byte logical sectors. > > Note: this patch is not about AF 512e drives. > Signed-off-by: Tom Yan <tom.ty89@gmail.com> > > diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c > index be9c76c..dcadcaf 100644 > --- a/drivers/ata/libata-scsi.c > +++ b/drivers/ata/libata-scsi.c > @@ -2295,6 +2295,7 @@ static unsigned int ata_scsiop_inq_89(struct ata_scsi_args *args, u8 *rbuf) > static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf) > { > u16 min_io_sectors; > + u32 sector_size; > > rbuf[1] = 0xb0; > rbuf[3] = 0x3c; /* required VPD size with unmap support */ > @@ -2309,17 +2310,27 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf) > min_io_sectors = 1 << ata_id_log2_per_physical_sector(args->id); > put_unaligned_be16(min_io_sectors, &rbuf[6]); > > - /* > - * Optimal unmap granularity. > - * > - * The ATA spec doesn't even know about a granularity or alignment > - * for the TRIM command. We can leave away most of the unmap related > - * VPD page entries, but we have specifify a granularity to signal > - * that we support some form of unmap - in thise case via WRITE SAME > - * with the unmap bit set. > - */ > + sector_size = ata_id_logical_sector_size(args->id); > if (ata_id_has_trim(args->id)) { > - put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM, &rbuf[36]); > + /* > + * Maximum write same length. > + * > + * Avoid overflow in discard_max_bytes and write_same_max_bytes > + * with AF 4Kn drives. Also make them consistent among drives > + * with different logical sector sizes. > + */ > + put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM / > + (sector_size / 512), &rbuf[36]); I think the existing fixups in sd_setup_discard_cmnd() and sd_setup_write_same_cmnd() are 'doing the right thing'. If I understand the stack correctly: libata-scsi.c (and sd.c) both report a maximum in terms of 512 byte sectors. The upper layer stack works (mostly) on a mix of bytes and 512 byte sectors agnostic of the underlying hardware ... mostly. There are some bits in the files systems and block layer that are honoring the logical block size being larger 512 bytes as all I/O being generated are multiples of the logical block size as per block device's request_queue / queue_limits. So regardless of a 4Kn device being able to handle an 8x larger I/O as per the logical sector being bigger that's basically ignored, for convenience. In the scsi upper layer as the command are being setup the shift from 512 to 'sector_size' is handled to the number of device sectors is matched up to the request: sector >>= ilog2(sdp->sector_size) - 9; nr_sectors >>= ilog2(sdp->sector_size) - 9; So if you correctly report number of logical sectors here you break the 'fix' in sd.c At least that is my understanding. > + > + /* > + * Optimal unmap granularity. > + * > + * The ATA spec doesn't even know about a granularity or alignment > + * for the TRIM command. We can leave away most of the unmap related > + * VPD page entries, but we have specifify a granularity to signal > + * that we support some form of unmap - in thise case via WRITE SAME > + * with the unmap bit set. > + */ > put_unaligned_be32(1, &rbuf[28]); > } > > -- > 2.9.2 > Regards, Shaun -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The patch isn't about how the request from the block layer will be processed (to form the SCSI commands). What it addresses is blk_queue_max_write_same_sectors() and blk_queue_max_discard_sectors() that are called in the SCSI disk driver. You can see that they are called with an input of the Maximum Write Same Length times (logical_block_size >> 9), which is to convert the number of sectors to an appropriate value for the block layer (512-byte block based). On 4Kn drives, the multiplier will be 4096 >> 9, which is 8. So if the reported Maximum Write Same Length is 4194240, when the value is passed onto the block layer to set the limits for a 4Kn drive, it will be 4194240 * 8. (And when this value is represented in bytes, it will be further multiplied by 512 and over 32-bit.) `logical_block_size >> 9` is pretty much the same thing as ``logical_block_size / 512`. I should have probably used the bit shift way instead. On 11 August 2016 at 17:04, Shaun Tancheff <shaun@tancheff.com> wrote: > On Thu, Aug 11, 2016 at 3:26 AM, <tom.ty89@gmail.com> wrote: >> From: Tom Yan <tom.ty89@gmail.com> >> >> Currently we advertise Maximum Write Same Length based on the >> maximum number of sectors that one-block TRIM payload can cover. >> The field are used to derived discard_max_bytes and >> write_same_max_bytes limits in the block layer, which currently can >> at max be 0xffffffff (32-bit). >> >> However, with a AF 4Kn drive, the derived limits would be 65535 * >> 64 * 4096 = 0x3fffc0000 (34-bit). Therefore, we now devide >> ATA_MAX_TRIM_RNUM with (logical sector size / 512), so that the >> derived limits will not overflow. >> >> The limits are now also consistent among drives with different >> logical sector sizes. (Although that may or may not be what we >> want ultimately when the SCSI / block layer allows larger >> representation in the future.) >> >> Although 4Kn ATA SSDs may not be a thing on the market yet, this >> patch is necessary for forthcoming SCT Write Same translation >> support, which could be available on traditional HDDs where 4Kn is >> already a thing. Also it should not change the current behavior on >> drives with 512-byte logical sectors. >> >> Note: this patch is not about AF 512e drives. >> Signed-off-by: Tom Yan <tom.ty89@gmail.com> >> >> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c >> index be9c76c..dcadcaf 100644 >> --- a/drivers/ata/libata-scsi.c >> +++ b/drivers/ata/libata-scsi.c >> @@ -2295,6 +2295,7 @@ static unsigned int ata_scsiop_inq_89(struct ata_scsi_args *args, u8 *rbuf) >> static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf) >> { >> u16 min_io_sectors; >> + u32 sector_size; >> >> rbuf[1] = 0xb0; >> rbuf[3] = 0x3c; /* required VPD size with unmap support */ >> @@ -2309,17 +2310,27 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf) >> min_io_sectors = 1 << ata_id_log2_per_physical_sector(args->id); >> put_unaligned_be16(min_io_sectors, &rbuf[6]); >> >> - /* >> - * Optimal unmap granularity. >> - * >> - * The ATA spec doesn't even know about a granularity or alignment >> - * for the TRIM command. We can leave away most of the unmap related >> - * VPD page entries, but we have specifify a granularity to signal >> - * that we support some form of unmap - in thise case via WRITE SAME >> - * with the unmap bit set. >> - */ >> + sector_size = ata_id_logical_sector_size(args->id); >> if (ata_id_has_trim(args->id)) { >> - put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM, &rbuf[36]); >> + /* >> + * Maximum write same length. >> + * >> + * Avoid overflow in discard_max_bytes and write_same_max_bytes >> + * with AF 4Kn drives. Also make them consistent among drives >> + * with different logical sector sizes. >> + */ >> + put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM / >> + (sector_size / 512), &rbuf[36]); > > I think the existing fixups in sd_setup_discard_cmnd() and > sd_setup_write_same_cmnd() > are 'doing the right thing'. > > If I understand the stack correctly: > > libata-scsi.c (and sd.c) both report a maximum in terms of 512 byte sectors. > The upper layer stack works (mostly) on a mix of bytes and 512 byte sectors > agnostic of the underlying hardware ... mostly. There are some bits in the > files systems and block layer that are honoring the logical block size being > larger 512 bytes as all I/O being generated are multiples of the logical block > size as per block device's request_queue / queue_limits. > > So regardless of a 4Kn device being able to handle an 8x larger I/O as per > the logical sector being bigger that's basically ignored, for convenience. > > In the scsi upper layer as the command are being setup the shift from > 512 to 'sector_size' is handled to the number of device sectors is > matched up to the request: > > sector >>= ilog2(sdp->sector_size) - 9; > nr_sectors >>= ilog2(sdp->sector_size) - 9; > > So if you correctly report number of logical sectors here you break > the 'fix' in sd.c > > At least that is my understanding. > >> + >> + /* >> + * Optimal unmap granularity. >> + * >> + * The ATA spec doesn't even know about a granularity or alignment >> + * for the TRIM command. We can leave away most of the unmap related >> + * VPD page entries, but we have specifify a granularity to signal >> + * that we support some form of unmap - in thise case via WRITE SAME >> + * with the unmap bit set. >> + */ >> put_unaligned_be32(1, &rbuf[28]); >> } >> >> -- >> 2.9.2 >> > > Regards, > Shaun -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Tom," == tom ty89 <tom.ty89@gmail.com> writes:
Tom,
+ put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM /
+ (sector_size / 512), &rbuf[36]);
MAXIMUM WRITE SAME LENGTH is in units of the device's logical block
size.
On 12 August 2016 at 09:34, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "Tom," == tom ty89 <tom.ty89@gmail.com> writes: > > Tom, > > + put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM / > + (sector_size / 512), &rbuf[36]); > > MAXIMUM WRITE SAME LENGTH is in units of the device's logical block > size. Yeah that's why: + sector_size = ata_id_logical_sector_size(args->id); Where the logical sector size of the device is reported with the same function in ata_scsi_dev_config() and ata_scsiop_read_cap(). So were you trying to pointing out something I am still missing, or were you merely confirming I was right? > > -- > Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes: Tom, >> put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM / (sector_size / 512), &rbuf[36]); How many 8-byte ranges fit in a 4096-byte sector? Tom> So were you trying to pointing out something I am still missing, or Tom> were you merely confirming I was right? I suggest you drop ATA_MAX_TRIM_RNUM and do: enum { ATA_TRIM_BLOCKS_PER_RANGE = 65535, /* 0xffff blocks per range desc. */ ATA_TRIM_RANGE_SIZE_SHIFT = 3, /* range descriptor is 8 bytes */ }; put_unaligned_be64(ATA_TRIM_BLOCKS_PER_RANGE * sector_size >> ATA_TRIM_RANGE_SIZE_SHIFT, &rbuf[36]); Might be worthwhile to create an ata_max_lba_range_blocks() wrapper.
On Fri, Aug 12, 2016 at 3:56 PM, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes: > > Tom, > >>> put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM / (sector_size / 512), &rbuf[36]); > > How many 8-byte ranges fit in a 4096-byte sector? > > Tom> So were you trying to pointing out something I am still missing, or > Tom> were you merely confirming I was right? > > I suggest you drop ATA_MAX_TRIM_RNUM and do: > > enum { > ATA_TRIM_BLOCKS_PER_RANGE = 65535, /* 0xffff blocks per range desc. */ > ATA_TRIM_RANGE_SIZE_SHIFT = 3, /* range descriptor is 8 bytes */ > }; > > put_unaligned_be64(ATA_TRIM_BLOCKS_PER_RANGE * > sector_size >> ATA_TRIM_RANGE_SIZE_SHIFT, &rbuf[36]); > > Might be worthwhile to create an ata_max_lba_range_blocks() wrapper. Ah, I think I am understanding now. When the sector size is 4K the minimum page sent with WRITE SAME will be 4K. If so, we also need to fix the write_same SATL code that is working under the assumption of a 512 byte sector sector as the largest guaranteed amount of data in the associated sg pages. Keying off of sector_size should be straight forward there...
>>>>> "Shaun" == Shaun Tancheff <shaun.tancheff@seagate.com> writes:
Shaun,
Shaun> Ah, I think I am understanding now. When the sector size is 4K
Shaun> the minimum page sent with WRITE SAME will be 4K.
Correct. We'll always transfer one logical block of data as required by
the spec.
On 13 August 2016 at 04:56, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes: > > Tom, > >>> put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM / (sector_size / 512), &rbuf[36]); > > How many 8-byte ranges fit in a 4096-byte sector? The thing is, as of ACS-4, blocks that carry DSM/TRIM LBA Range Entries are always 512-byte. Honestly, I have no idea how that would work on a 4Kn SSD, if it is / will ever be a thing. Anyway I was NOT trying to bump the number entries allowed, but instead to decrease it accordingly for the increased sector size, only to make sure that the Maximum Write Same Length advertised here is gonna overflow the 32-bit limit when it is converted to a representation in bytes (i.e. discard_max_bytes and write_same_max_bytes), and probably also, the bio size. It's also a suggestion from me for Shaun to determine how the Maximum Write Same Length to be advertised, so that the size in bytes covered by a single SCT Write Same will stay at ~2G in spite of the logical sector size. > > Tom> So were you trying to pointing out something I am still missing, or > Tom> were you merely confirming I was right? > > I suggest you drop ATA_MAX_TRIM_RNUM and do: > > enum { > ATA_TRIM_BLOCKS_PER_RANGE = 65535, /* 0xffff blocks per range desc. */ > ATA_TRIM_RANGE_SIZE_SHIFT = 3, /* range descriptor is 8 bytes */ > }; > > put_unaligned_be64(ATA_TRIM_BLOCKS_PER_RANGE * > sector_size >> ATA_TRIM_RANGE_SIZE_SHIFT, &rbuf[36]); > > Might be worthwhile to create an ata_max_lba_range_blocks() wrapper. > > -- > Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
Tom,
Tom> The thing is, as of ACS-4, blocks that carry DSM/TRIM LBA Range
Tom> Entries are always 512-byte.
Lovely. And SAT conveniently ignores this entirely.
Tom> Honestly, I have no idea how that would work on a 4Kn SSD, if it is
Tom> / will ever be a thing.
Highly unlikely.
But I guess you would have to report 8 512-byte ranges in the 4Kn
case. And then rely on the patch we talked about to clamp the number of
sectors based on the block layer limit.
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index be9c76c..dcadcaf 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -2295,6 +2295,7 @@ static unsigned int ata_scsiop_inq_89(struct ata_scsi_args *args, u8 *rbuf) static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf) { u16 min_io_sectors; + u32 sector_size; rbuf[1] = 0xb0; rbuf[3] = 0x3c; /* required VPD size with unmap support */ @@ -2309,17 +2310,27 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf) min_io_sectors = 1 << ata_id_log2_per_physical_sector(args->id); put_unaligned_be16(min_io_sectors, &rbuf[6]); - /* - * Optimal unmap granularity. - * - * The ATA spec doesn't even know about a granularity or alignment - * for the TRIM command. We can leave away most of the unmap related - * VPD page entries, but we have specifify a granularity to signal - * that we support some form of unmap - in thise case via WRITE SAME - * with the unmap bit set. - */ + sector_size = ata_id_logical_sector_size(args->id); if (ata_id_has_trim(args->id)) { - put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM, &rbuf[36]); + /* + * Maximum write same length. + * + * Avoid overflow in discard_max_bytes and write_same_max_bytes + * with AF 4Kn drives. Also make them consistent among drives + * with different logical sector sizes. + */ + put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM / + (sector_size / 512), &rbuf[36]); + + /* + * Optimal unmap granularity. + * + * The ATA spec doesn't even know about a granularity or alignment + * for the TRIM command. We can leave away most of the unmap related + * VPD page entries, but we have specifify a granularity to signal + * that we support some form of unmap - in thise case via WRITE SAME + * with the unmap bit set. + */ put_unaligned_be32(1, &rbuf[28]); }