diff mbox

[RFC] libata-scsi: make sure Maximum Write Same Length is not too large

Message ID 833713246c1a70d85150289c1537797d6f7dd606.1470903899.git.tom.ty89@gmail.com (mailing list archive)
State Changes Requested, archived
Headers show

Commit Message

Tom Yan Aug. 11, 2016, 8:26 a.m. UTC
From: Tom Yan <tom.ty89@gmail.com>

Currently we advertise Maximum Write Same Length based on the
maximum number of sectors that one-block TRIM payload can cover.
The field are used to derived discard_max_bytes and
write_same_max_bytes limits in the block layer, which currently can
at max be 0xffffffff (32-bit).

However, with a AF 4Kn drive, the derived limits would be 65535 *
64 * 4096 = 0x3fffc0000 (34-bit). Therefore, we now devide
ATA_MAX_TRIM_RNUM with (logical sector size / 512), so that the
derived limits will not overflow.

The limits are now also consistent among drives with different
logical sector sizes. (Although that may or may not be what we
want ultimately when the SCSI / block layer allows larger
representation in the future.)

Although 4Kn ATA SSDs may not be a thing on the market yet, this
patch is necessary for forthcoming SCT Write Same translation
support, which could be available on traditional HDDs where 4Kn is
already a thing. Also it should not change the current behavior on
drives with 512-byte logical sectors.

Note: this patch is not about AF 512e drives.
Signed-off-by: Tom Yan <tom.ty89@gmail.com>

Comments

Shaun Tancheff Aug. 11, 2016, 5:04 p.m. UTC | #1
On Thu, Aug 11, 2016 at 3:26 AM,  <tom.ty89@gmail.com> wrote:
> From: Tom Yan <tom.ty89@gmail.com>
>
> Currently we advertise Maximum Write Same Length based on the
> maximum number of sectors that one-block TRIM payload can cover.
> The field are used to derived discard_max_bytes and
> write_same_max_bytes limits in the block layer, which currently can
> at max be 0xffffffff (32-bit).
>
> However, with a AF 4Kn drive, the derived limits would be 65535 *
> 64 * 4096 = 0x3fffc0000 (34-bit). Therefore, we now devide
> ATA_MAX_TRIM_RNUM with (logical sector size / 512), so that the
> derived limits will not overflow.
>
> The limits are now also consistent among drives with different
> logical sector sizes. (Although that may or may not be what we
> want ultimately when the SCSI / block layer allows larger
> representation in the future.)
>
> Although 4Kn ATA SSDs may not be a thing on the market yet, this
> patch is necessary for forthcoming SCT Write Same translation
> support, which could be available on traditional HDDs where 4Kn is
> already a thing. Also it should not change the current behavior on
> drives with 512-byte logical sectors.
>
> Note: this patch is not about AF 512e drives.
> Signed-off-by: Tom Yan <tom.ty89@gmail.com>
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index be9c76c..dcadcaf 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -2295,6 +2295,7 @@ static unsigned int ata_scsiop_inq_89(struct ata_scsi_args *args, u8 *rbuf)
>  static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
>  {
>         u16 min_io_sectors;
> +       u32 sector_size;
>
>         rbuf[1] = 0xb0;
>         rbuf[3] = 0x3c;         /* required VPD size with unmap support */
> @@ -2309,17 +2310,27 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
>         min_io_sectors = 1 << ata_id_log2_per_physical_sector(args->id);
>         put_unaligned_be16(min_io_sectors, &rbuf[6]);
>
> -       /*
> -        * Optimal unmap granularity.
> -        *
> -        * The ATA spec doesn't even know about a granularity or alignment
> -        * for the TRIM command.  We can leave away most of the unmap related
> -        * VPD page entries, but we have specifify a granularity to signal
> -        * that we support some form of unmap - in thise case via WRITE SAME
> -        * with the unmap bit set.
> -        */
> +       sector_size = ata_id_logical_sector_size(args->id);
>         if (ata_id_has_trim(args->id)) {
> -               put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM, &rbuf[36]);
> +               /*
> +                * Maximum write same length.
> +                *
> +                * Avoid overflow in discard_max_bytes and write_same_max_bytes
> +                * with AF 4Kn drives. Also make them consistent among drives
> +                * with different logical sector sizes.
> +                */
> +               put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM /
> +                                  (sector_size / 512), &rbuf[36]);

I think the existing fixups in sd_setup_discard_cmnd() and
sd_setup_write_same_cmnd()
are 'doing the right thing'.

If I understand the stack correctly:

libata-scsi.c (and sd.c) both report a maximum in terms of 512 byte sectors.
The upper layer stack works (mostly) on a mix of bytes and 512 byte sectors
agnostic of the underlying hardware ... mostly. There are some bits in the
files systems and block layer that are honoring the logical block size being
larger 512 bytes as all I/O being generated are multiples of the logical block
size as per block device's request_queue / queue_limits.

So regardless of a 4Kn device being able to handle an 8x larger I/O as per
the logical sector being bigger that's basically ignored, for convenience.

In the scsi upper layer as the command are being setup the shift from
512 to 'sector_size' is handled to the number of device sectors is
matched up to the request:

    sector >>= ilog2(sdp->sector_size) - 9;
    nr_sectors >>= ilog2(sdp->sector_size) - 9;

So if you correctly report number of logical sectors here you break
the 'fix' in sd.c

At least that is my understanding.

> +
> +               /*
> +                * Optimal unmap granularity.
> +                *
> +                * The ATA spec doesn't even know about a granularity or alignment
> +                * for the TRIM command.  We can leave away most of the unmap related
> +                * VPD page entries, but we have specifify a granularity to signal
> +                * that we support some form of unmap - in thise case via WRITE SAME
> +                * with the unmap bit set.
> +                */
>                 put_unaligned_be32(1, &rbuf[28]);
>         }
>
> --
> 2.9.2
>

Regards,
Shaun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Yan Aug. 11, 2016, 9:17 p.m. UTC | #2
The patch isn't about how the request from the block layer will be
processed (to form the SCSI commands).

What it addresses is blk_queue_max_write_same_sectors() and
blk_queue_max_discard_sectors() that are called in the SCSI disk
driver. You can see that they are called with an input of the Maximum
Write Same Length times (logical_block_size >> 9), which is to convert
the number of sectors to an appropriate value for the block layer
(512-byte block based). On 4Kn drives, the multiplier will be 4096 >>
9, which is 8. So if the reported Maximum Write Same Length is
4194240, when the value is passed onto the block layer to set the
limits for a 4Kn drive, it will be 4194240 * 8. (And when this value
is represented in bytes, it will be further multiplied by 512 and over
32-bit.)

`logical_block_size >> 9` is pretty much the same thing as
``logical_block_size / 512`. I should have probably used the bit shift
way instead.

On 11 August 2016 at 17:04, Shaun Tancheff <shaun@tancheff.com> wrote:
> On Thu, Aug 11, 2016 at 3:26 AM,  <tom.ty89@gmail.com> wrote:
>> From: Tom Yan <tom.ty89@gmail.com>
>>
>> Currently we advertise Maximum Write Same Length based on the
>> maximum number of sectors that one-block TRIM payload can cover.
>> The field are used to derived discard_max_bytes and
>> write_same_max_bytes limits in the block layer, which currently can
>> at max be 0xffffffff (32-bit).
>>
>> However, with a AF 4Kn drive, the derived limits would be 65535 *
>> 64 * 4096 = 0x3fffc0000 (34-bit). Therefore, we now devide
>> ATA_MAX_TRIM_RNUM with (logical sector size / 512), so that the
>> derived limits will not overflow.
>>
>> The limits are now also consistent among drives with different
>> logical sector sizes. (Although that may or may not be what we
>> want ultimately when the SCSI / block layer allows larger
>> representation in the future.)
>>
>> Although 4Kn ATA SSDs may not be a thing on the market yet, this
>> patch is necessary for forthcoming SCT Write Same translation
>> support, which could be available on traditional HDDs where 4Kn is
>> already a thing. Also it should not change the current behavior on
>> drives with 512-byte logical sectors.
>>
>> Note: this patch is not about AF 512e drives.
>> Signed-off-by: Tom Yan <tom.ty89@gmail.com>
>>
>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>> index be9c76c..dcadcaf 100644
>> --- a/drivers/ata/libata-scsi.c
>> +++ b/drivers/ata/libata-scsi.c
>> @@ -2295,6 +2295,7 @@ static unsigned int ata_scsiop_inq_89(struct ata_scsi_args *args, u8 *rbuf)
>>  static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
>>  {
>>         u16 min_io_sectors;
>> +       u32 sector_size;
>>
>>         rbuf[1] = 0xb0;
>>         rbuf[3] = 0x3c;         /* required VPD size with unmap support */
>> @@ -2309,17 +2310,27 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
>>         min_io_sectors = 1 << ata_id_log2_per_physical_sector(args->id);
>>         put_unaligned_be16(min_io_sectors, &rbuf[6]);
>>
>> -       /*
>> -        * Optimal unmap granularity.
>> -        *
>> -        * The ATA spec doesn't even know about a granularity or alignment
>> -        * for the TRIM command.  We can leave away most of the unmap related
>> -        * VPD page entries, but we have specifify a granularity to signal
>> -        * that we support some form of unmap - in thise case via WRITE SAME
>> -        * with the unmap bit set.
>> -        */
>> +       sector_size = ata_id_logical_sector_size(args->id);
>>         if (ata_id_has_trim(args->id)) {
>> -               put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM, &rbuf[36]);
>> +               /*
>> +                * Maximum write same length.
>> +                *
>> +                * Avoid overflow in discard_max_bytes and write_same_max_bytes
>> +                * with AF 4Kn drives. Also make them consistent among drives
>> +                * with different logical sector sizes.
>> +                */
>> +               put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM /
>> +                                  (sector_size / 512), &rbuf[36]);
>
> I think the existing fixups in sd_setup_discard_cmnd() and
> sd_setup_write_same_cmnd()
> are 'doing the right thing'.
>
> If I understand the stack correctly:
>
> libata-scsi.c (and sd.c) both report a maximum in terms of 512 byte sectors.
> The upper layer stack works (mostly) on a mix of bytes and 512 byte sectors
> agnostic of the underlying hardware ... mostly. There are some bits in the
> files systems and block layer that are honoring the logical block size being
> larger 512 bytes as all I/O being generated are multiples of the logical block
> size as per block device's request_queue / queue_limits.
>
> So regardless of a 4Kn device being able to handle an 8x larger I/O as per
> the logical sector being bigger that's basically ignored, for convenience.
>
> In the scsi upper layer as the command are being setup the shift from
> 512 to 'sector_size' is handled to the number of device sectors is
> matched up to the request:
>
>     sector >>= ilog2(sdp->sector_size) - 9;
>     nr_sectors >>= ilog2(sdp->sector_size) - 9;
>
> So if you correctly report number of logical sectors here you break
> the 'fix' in sd.c
>
> At least that is my understanding.
>
>> +
>> +               /*
>> +                * Optimal unmap granularity.
>> +                *
>> +                * The ATA spec doesn't even know about a granularity or alignment
>> +                * for the TRIM command.  We can leave away most of the unmap related
>> +                * VPD page entries, but we have specifify a granularity to signal
>> +                * that we support some form of unmap - in thise case via WRITE SAME
>> +                * with the unmap bit set.
>> +                */
>>                 put_unaligned_be32(1, &rbuf[28]);
>>         }
>>
>> --
>> 2.9.2
>>
>
> Regards,
> Shaun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Aug. 12, 2016, 1:34 a.m. UTC | #3
>>>>> "Tom," == tom ty89 <tom.ty89@gmail.com> writes:

Tom,

+		put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM /
+				   (sector_size / 512), &rbuf[36]);

MAXIMUM WRITE SAME LENGTH is in units of the device's logical block
size.
Tom Yan Aug. 12, 2016, 5:02 a.m. UTC | #4
On 12 August 2016 at 09:34, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Tom," == tom ty89 <tom.ty89@gmail.com> writes:
>
> Tom,
>
> +               put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM /
> +                                  (sector_size / 512), &rbuf[36]);
>
> MAXIMUM WRITE SAME LENGTH is in units of the device's logical block
> size.

Yeah that's why:

+       sector_size = ata_id_logical_sector_size(args->id);

Where the logical sector size of the device is reported with the same
function in ata_scsi_dev_config() and ata_scsiop_read_cap().

So were you trying to pointing out something I am still missing, or
were you merely confirming I was right?

>
> --
> Martin K. Petersen      Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Aug. 12, 2016, 8:56 p.m. UTC | #5
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:

Tom,

>> put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM / (sector_size / 512), &rbuf[36]);

How many 8-byte ranges fit in a 4096-byte sector?

Tom> So were you trying to pointing out something I am still missing, or
Tom> were you merely confirming I was right?

I suggest you drop ATA_MAX_TRIM_RNUM and do:

enum {
     ATA_TRIM_BLOCKS_PER_RANGE = 65535, /* 0xffff blocks per range desc. */
     ATA_TRIM_RANGE_SIZE_SHIFT = 3,     /* range descriptor is 8 bytes */
};

put_unaligned_be64(ATA_TRIM_BLOCKS_PER_RANGE *
		   sector_size >> ATA_TRIM_RANGE_SIZE_SHIFT, &rbuf[36]);

Might be worthwhile to create an ata_max_lba_range_blocks() wrapper.
Shaun Tancheff Aug. 12, 2016, 9:49 p.m. UTC | #6
On Fri, Aug 12, 2016 at 3:56 PM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
>
> Tom,
>
>>> put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM / (sector_size / 512), &rbuf[36]);
>
> How many 8-byte ranges fit in a 4096-byte sector?
>
> Tom> So were you trying to pointing out something I am still missing, or
> Tom> were you merely confirming I was right?
>
> I suggest you drop ATA_MAX_TRIM_RNUM and do:
>
> enum {
>      ATA_TRIM_BLOCKS_PER_RANGE = 65535, /* 0xffff blocks per range desc. */
>      ATA_TRIM_RANGE_SIZE_SHIFT = 3,     /* range descriptor is 8 bytes */
> };
>
> put_unaligned_be64(ATA_TRIM_BLOCKS_PER_RANGE *
>                    sector_size >> ATA_TRIM_RANGE_SIZE_SHIFT, &rbuf[36]);
>
> Might be worthwhile to create an ata_max_lba_range_blocks() wrapper.

Ah, I think I am understanding now. When the sector size is 4K the
minimum page sent with WRITE SAME will be 4K.

If so, we also need to fix the write_same SATL code that is working
under the assumption of a 512 byte sector sector as the largest
guaranteed amount of data in the associated sg pages.
Keying off of sector_size should be straight forward there...
Martin K. Petersen Aug. 12, 2016, 10:15 p.m. UTC | #7
>>>>> "Shaun" == Shaun Tancheff <shaun.tancheff@seagate.com> writes:

Shaun,

Shaun> Ah, I think I am understanding now. When the sector size is 4K
Shaun> the minimum page sent with WRITE SAME will be 4K.

Correct. We'll always transfer one logical block of data as required by
the spec.
Tom Yan Aug. 14, 2016, 9:23 a.m. UTC | #8
On 13 August 2016 at 04:56, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
>
> Tom,
>
>>> put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM / (sector_size / 512), &rbuf[36]);
>
> How many 8-byte ranges fit in a 4096-byte sector?

The thing is, as of ACS-4, blocks that carry DSM/TRIM LBA Range
Entries are always 512-byte. Honestly, I have no idea how that would
work on a 4Kn SSD, if it is / will ever be a thing.

Anyway I was NOT trying to bump the number entries allowed, but
instead to decrease it accordingly for the increased sector size, only
to make sure that the Maximum Write Same Length advertised here is
gonna overflow the 32-bit limit when it is converted to a
representation in bytes (i.e. discard_max_bytes and
write_same_max_bytes), and probably also, the bio size.

It's also a suggestion from me for Shaun to determine how the Maximum
Write Same Length to be advertised, so that the size in bytes covered
by a single SCT Write Same will stay at ~2G in spite of the logical
sector size.

>
> Tom> So were you trying to pointing out something I am still missing, or
> Tom> were you merely confirming I was right?
>
> I suggest you drop ATA_MAX_TRIM_RNUM and do:
>
> enum {
>      ATA_TRIM_BLOCKS_PER_RANGE = 65535, /* 0xffff blocks per range desc. */
>      ATA_TRIM_RANGE_SIZE_SHIFT = 3,     /* range descriptor is 8 bytes */
> };
>
> put_unaligned_be64(ATA_TRIM_BLOCKS_PER_RANGE *
>                    sector_size >> ATA_TRIM_RANGE_SIZE_SHIFT, &rbuf[36]);
>
> Might be worthwhile to create an ata_max_lba_range_blocks() wrapper.
>
> --
> Martin K. Petersen      Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Aug. 16, 2016, 4:15 a.m. UTC | #9
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:

Tom,

Tom> The thing is, as of ACS-4, blocks that carry DSM/TRIM LBA Range
Tom> Entries are always 512-byte.

Lovely. And SAT conveniently ignores this entirely.

Tom> Honestly, I have no idea how that would work on a 4Kn SSD, if it is
Tom> / will ever be a thing.

Highly unlikely.

But I guess you would have to report 8 512-byte ranges in the 4Kn
case. And then rely on the patch we talked about to clamp the number of
sectors based on the block layer limit.
diff mbox

Patch

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index be9c76c..dcadcaf 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2295,6 +2295,7 @@  static unsigned int ata_scsiop_inq_89(struct ata_scsi_args *args, u8 *rbuf)
 static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
 {
 	u16 min_io_sectors;
+	u32 sector_size;
 
 	rbuf[1] = 0xb0;
 	rbuf[3] = 0x3c;		/* required VPD size with unmap support */
@@ -2309,17 +2310,27 @@  static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
 	min_io_sectors = 1 << ata_id_log2_per_physical_sector(args->id);
 	put_unaligned_be16(min_io_sectors, &rbuf[6]);
 
-	/*
-	 * Optimal unmap granularity.
-	 *
-	 * The ATA spec doesn't even know about a granularity or alignment
-	 * for the TRIM command.  We can leave away most of the unmap related
-	 * VPD page entries, but we have specifify a granularity to signal
-	 * that we support some form of unmap - in thise case via WRITE SAME
-	 * with the unmap bit set.
-	 */
+	sector_size = ata_id_logical_sector_size(args->id);
 	if (ata_id_has_trim(args->id)) {
-		put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM, &rbuf[36]);
+		/*
+		 * Maximum write same length.
+		 *
+		 * Avoid overflow in discard_max_bytes and write_same_max_bytes
+		 * with AF 4Kn drives. Also make them consistent among drives
+		 * with different logical sector sizes.
+		 */
+		put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM /
+				   (sector_size / 512), &rbuf[36]);
+
+		/*
+		 * Optimal unmap granularity.
+		 *
+		 * The ATA spec doesn't even know about a granularity or alignment
+		 * for the TRIM command.  We can leave away most of the unmap related
+		 * VPD page entries, but we have specifify a granularity to signal
+		 * that we support some form of unmap - in thise case via WRITE SAME
+		 * with the unmap bit set.
+		 */
 		put_unaligned_be32(1, &rbuf[28]);
 	}