diff mbox series

sd: disable logical block provisioning if 'lbpme' is not set

Message ID 20190214221558.09174756@endymion (mailing list archive)
State Changes Requested
Headers show
Series sd: disable logical block provisioning if 'lbpme' is not set | expand

Commit Message

Jean Delvare Feb. 14, 2019, 9:15 p.m. UTC
From: Hannes Reinecke <hare@suse.com>

When evaluating the 'block limits' VPD page we need to check if
the 'lbpme' (logical block provisioning management enable) bit
is set in the READ CAPACITY (16) output.
If it isn't we can safely assume that we cannot use DISCARD on
this device.

[JD: forward-ported to kernel v4.20]

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jean Delvare <jdelvare@suse.com>
---
Hannes, please double-check that my forward-port is correct.

 drivers/scsi/sd.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Bart Van Assche Feb. 15, 2019, 3:56 p.m. UTC | #1
On Thu, 2019-02-14 at 22:15 +0100, Jean Delvare wrote:
> From: Hannes Reinecke <hare@suse.com>
> 
> When evaluating the 'block limits' VPD page we need to check if
> the 'lbpme' (logical block provisioning management enable) bit
> is set in the READ CAPACITY (16) output.
> If it isn't we can safely assume that we cannot use DISCARD on
> this device.
> 
> [JD: forward-ported to kernel v4.20]
> 
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Jean Delvare <jdelvare@suse.com>
> ---
> Hannes, please double-check that my forward-port is correct.
> 
>  drivers/scsi/sd.c |   11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -411,6 +411,13 @@ provisioning_mode_store(struct device *d
>  	if (mode < 0)
>  		return -EINVAL;
>  
> +	/*
> +	 * If logical block provisioning isn't enabled we can only
> +	 * select 'disable' here.
> +	 */
> +	if (!sdkp->lbpme && mode != SD_LBP_DISABLE)
> +		return -EINVAL;
> +
>  	sd_config_discard(sdkp, mode);
>  
>  	return count;
> @@ -2942,8 +2949,10 @@ static void sd_read_block_limits(struct
>  
>  		sdkp->max_ws_blocks = (u32)get_unaligned_be64(&buffer[36]);
>  
> -		if (!sdkp->lbpme)
> +		if (!sdkp->lbpme) {
> +			sd_config_discard(sdkp, SD_LBP_DISABLE);
>  			goto out;
> +		}
>  
>  		lba_count = get_unaligned_be32(&buffer[20]);
>  		desc_count = get_unaligned_be32(&buffer[24]);

What is the impact of this patch on SATA SSDs? Since these SSDs do not support
logical provisioning, does this patch break trim support for these SSDs?

Bart.
Martin K. Petersen Feb. 16, 2019, 3:30 a.m. UTC | #2
Jean,

> When evaluating the 'block limits' VPD page we need to check if the
> 'lbpme' (logical block provisioning management enable) bit is set in
> the READ CAPACITY (16) output.  If it isn't we can safely assume that
> we cannot use DISCARD on this device.

That's not entirely correct. We still support devices which predate
logical block provisioning getting standardized in SBC. These devices
are typically driven by the zeroing method and don't have LBPME, nor the
LBP VPD page to key off of. Instead they are triggered by setting the
provisioning_mode via udev.

I am not sure how many of these are still around, I am hoping very
few. But I'd prefer to not break anything.

Can you describe the specific problem your patch is aiming to address?
Jean Delvare Feb. 18, 2019, 9:56 a.m. UTC | #3
Hi Bart,

Sorry to resend this, my previous reply had a stray backslash in the
headers (no idea why) that confused certain MTA, so at least Hannes did
not receive it.

I take benefit of this resend to update with the latest information, so
keep reading.

On Fri, 15 Feb 2019 07:56:11 -0800, Bart Van Assche wrote:
> On Thu, 2019-02-14 at 22:15 +0100, Jean Delvare wrote:
> > From: Hannes Reinecke <hare@suse.com>
> > 
> > When evaluating the 'block limits' VPD page we need to check if
> > the 'lbpme' (logical block provisioning management enable) bit
> > is set in the READ CAPACITY (16) output.
> > If it isn't we can safely assume that we cannot use DISCARD on
> > this device.
> > 
> > [JD: forward-ported to kernel v4.20]
> > 
> > Signed-off-by: Hannes Reinecke <hare@suse.com>
> > Signed-off-by: Jean Delvare <jdelvare@suse.com>
> > ---
> > Hannes, please double-check that my forward-port is correct.
> > 
> >  drivers/scsi/sd.c |   11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > --- a/drivers/scsi/sd.c
> > +++ b/drivers/scsi/sd.c
> > @@ -411,6 +411,13 @@ provisioning_mode_store(struct device *d
> >  	if (mode < 0)
> >  		return -EINVAL;
> >  
> > +	/*
> > +	 * If logical block provisioning isn't enabled we can only
> > +	 * select 'disable' here.
> > +	 */
> > +	if (!sdkp->lbpme && mode != SD_LBP_DISABLE)
> > +		return -EINVAL;
> > +
> >  	sd_config_discard(sdkp, mode);
> >  
> >  	return count;
> > @@ -2942,8 +2949,10 @@ static void sd_read_block_limits(struct
> >  
> >  		sdkp->max_ws_blocks = (u32)get_unaligned_be64(&buffer[36]);
> >  
> > -		if (!sdkp->lbpme)
> > +		if (!sdkp->lbpme) {
> > +			sd_config_discard(sdkp, SD_LBP_DISABLE);
> >  			goto out;
> > +		}
> >  
> >  		lba_count = get_unaligned_be32(&buffer[20]);
> >  		desc_count = get_unaligned_be32(&buffer[24]);  
> 
> What is the impact of this patch on SATA SSDs? Since these SSDs do not support
> logical provisioning, does this patch break trim support for these SSDs?

Thanks for your feedback. It should be noted that I am only the
messenger here, not the author of the patch. I am only trying to ensure
that upstream is not missing a fix that we have in our SUSE kernels. My
knowledge about the sd driver or storage in general is very thin, but
I'll do my best to answer questions and address issues. Any help from
the experts will be appreciated.

What makes you think that SATA SSDs do not support logical
provisioning? I added log messages to the sd driver and booted the
instrumented driver on my workstation, which has 1 SATA SSD and 2 SATA
HHD. sd_read_block_limits is called 3 times for each (not sure why).
For the SSD, sdkp->lbpme=1. For the HDD, sdkp->lbpme=0. So I think the
code should do the right thing for SATA SSD?

[    1.351917] scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD 850  2B6Q PQ: 0 ANSI: 5
[    1.352975] scsi_device 0:0:0:0: sd_read_block_limits called, sdkp->lbpme=1

[    1.366916] scsi 1:0:0:0: Direct-Access     ATA      ST2000VM003-1ET1 SC12 PQ: 0 ANSI: 5
[    1.367669] scsi_device 1:0:0:0: sd_read_block_limits called, sdkp->lbpme=0

[    1.391266] scsi 3:0:0:0: Direct-Access     ATA      ST1000DM003-1ER1 CC45 PQ: 0 ANSI: 5
[    1.391854] scsi_device 3:0:0:0: sd_read_block_limits called, sdkp->lbpme=0

Of course that's only one sample. Meanwhile I noticed that sdkp->lbpme
can be read from the "thin_provisioning" sysfs attribute, without
instrumenting the kernel. So I checked on my father's system and wife's
system, and both an INTEL SSDSA2CT04 (2011) and SAMSUNG SSD 830 (2013)
have sdkp->lbpme=1. So SATA SSD seems to be safe.

One thing that worries me more is that SATA HDD have provisioning_mode
set to "full" (SD_LBP_FULL) by default, while Hannes' patch only allows
writing "disabled" (SD_LBP_DISABLE) to this sysfs attribute file when
sdkp->lbpme=0. If SD_LBP_FULL was valid at boot time, then writing that
value back should certainly be allowed. Hannes?
Hannes Reinecke Feb. 18, 2019, 10:47 a.m. UTC | #4
On 2/18/19 10:56 AM, Jean Delvare wrote:
> Hi Bart,
> 
> Sorry to resend this, my previous reply had a stray backslash in the
> headers (no idea why) that confused certain MTA, so at least Hannes did
> not receive it.
> 
> I take benefit of this resend to update with the latest information, so
> keep reading.
> 
> On Fri, 15 Feb 2019 07:56:11 -0800, Bart Van Assche wrote:
>> On Thu, 2019-02-14 at 22:15 +0100, Jean Delvare wrote:
>>> From: Hannes Reinecke <hare@suse.com>
>>>
>>> When evaluating the 'block limits' VPD page we need to check if
>>> the 'lbpme' (logical block provisioning management enable) bit
>>> is set in the READ CAPACITY (16) output.
>>> If it isn't we can safely assume that we cannot use DISCARD on
>>> this device.
>>>
>>> [JD: forward-ported to kernel v4.20]
>>>
>>> Signed-off-by: Hannes Reinecke <hare@suse.com>
>>> Signed-off-by: Jean Delvare <jdelvare@suse.com>
>>> ---
>>> Hannes, please double-check that my forward-port is correct.
>>>
>>>   drivers/scsi/sd.c |   11 ++++++++++-
>>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> --- a/drivers/scsi/sd.c
>>> +++ b/drivers/scsi/sd.c
>>> @@ -411,6 +411,13 @@ provisioning_mode_store(struct device *d
>>>   	if (mode < 0)
>>>   		return -EINVAL;
>>>   
>>> +	/*
>>> +	 * If logical block provisioning isn't enabled we can only
>>> +	 * select 'disable' here.
>>> +	 */
>>> +	if (!sdkp->lbpme && mode != SD_LBP_DISABLE)
>>> +		return -EINVAL;
>>> +
>>>   	sd_config_discard(sdkp, mode);
>>>   
>>>   	return count;
>>> @@ -2942,8 +2949,10 @@ static void sd_read_block_limits(struct
>>>   
>>>   		sdkp->max_ws_blocks = (u32)get_unaligned_be64(&buffer[36]);
>>>   
>>> -		if (!sdkp->lbpme)
>>> +		if (!sdkp->lbpme) {
>>> +			sd_config_discard(sdkp, SD_LBP_DISABLE);
>>>   			goto out;
>>> +		}
>>>   
>>>   		lba_count = get_unaligned_be32(&buffer[20]);
>>>   		desc_count = get_unaligned_be32(&buffer[24]);
>>
>> What is the impact of this patch on SATA SSDs? Since these SSDs do not support
>> logical provisioning, does this patch break trim support for these SSDs?
> 
None.

SATA SSDs as provided by libata will always set the 'lbpme' bit if TRIM 
is supported:

		if (ata_id_has_trim(args->id) &&
		    !(dev->horkage & ATA_HORKAGE_NOTRIM)) {
			rbuf[14] |= 0x80; /* LBPME */

			if (ata_id_has_zero_after_trim(args->id) &&
			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
				ata_dev_info(dev, "Enabling discard_zeroes_data\n");
				rbuf[14] |= 0x40; /* LBPRZ */
			}
		}

Of course, I cannot speak for other SATL implementations (mpt3sas etc), 
but they are known to have other issues, too, so I don't really care.

Cheers,

Hannes
Jean Delvare Feb. 18, 2019, 10:48 a.m. UTC | #5
Hi Martin,

On Fri, 2019-02-15 at 22:30 -0500, Martin K. Petersen wrote:
> When evaluating the 'block limits' VPD page we need to check if the
> > 'lbpme' (logical block provisioning management enable) bit is set in
> > the READ CAPACITY (16) output.  If it isn't we can safely assume that
> > we cannot use DISCARD on this device.
> 
> That's not entirely correct. We still support devices which predate
> logical block provisioning getting standardized in SBC. These devices
> are typically driven by the zeroing method and don't have LBPME, nor the
> LBP VPD page to key off of. Instead they are triggered by setting the
> provisioning_mode via udev.
> 
> I am not sure how many of these are still around, I am hoping very
> few. But I'd prefer to not break anything.

So basically the condition "if (!sdkp->lbpme)" should be replaced by
something like "if (LBP_VPD_page_is_supported && !sdkp->lbpme)"?

Unfortunately it does not look like the driver keeps memory of
successfully executing read_capacity_16(). do we need a new flag in
sdkp to remember read_capacity_16() success? Or can we just test
sdkp->lbpvpd?

> Can you describe the specific problem your patch is aiming to address?

The problem that was reported to us by several customers and which this
patch attempts to address is file system corruption. Specifically
metadata corruption reported on XFS filesystems by
xfs_agf_read_verify(). Investigation revealed that the problematic
areas had been zeroed-out at the block device level. This was in Azure
VM, using dm-stripe, with an fstrim service running automatically on a
weekly basis (default SUSE setting, in case the system has one or more
SSD).

The details are beyond me to be honest, but my limited understanding is
that calling TRIM on virtual block devices for which TRIM makes no
sense ended up zeroing storage areas and causing the corruption. Hannes
or Jeff may be able to explain it better, hopefully.
Jeff Mahoney Feb. 19, 2019, 3:07 p.m. UTC | #6
On 2/18/19 5:48 AM, Jean Delvare wrote:
> Hi Martin,
> 
> On Fri, 2019-02-15 at 22:30 -0500, Martin K. Petersen wrote:
>> When evaluating the 'block limits' VPD page we need to check if the
>>> 'lbpme' (logical block provisioning management enable) bit is set in
>>> the READ CAPACITY (16) output.  If it isn't we can safely assume that
>>> we cannot use DISCARD on this device.
>>
>> That's not entirely correct. We still support devices which predate
>> logical block provisioning getting standardized in SBC. These devices
>> are typically driven by the zeroing method and don't have LBPME, nor the
>> LBP VPD page to key off of. Instead they are triggered by setting the
>> provisioning_mode via udev.
>>
>> I am not sure how many of these are still around, I am hoping very
>> few. But I'd prefer to not break anything.
> 
> So basically the condition "if (!sdkp->lbpme)" should be replaced by
> something like "if (LBP_VPD_page_is_supported && !sdkp->lbpme)"?
> 
> Unfortunately it does not look like the driver keeps memory of
> successfully executing read_capacity_16(). do we need a new flag in
> sdkp to remember read_capacity_16() success? Or can we just test
> sdkp->lbpvpd?
> 
>> Can you describe the specific problem your patch is aiming to address?
> 
> The problem that was reported to us by several customers and which this
> patch attempts to address is file system corruption. Specifically
> metadata corruption reported on XFS filesystems by
> xfs_agf_read_verify(). Investigation revealed that the problematic
> areas had been zeroed-out at the block device level. This was in Azure
> VM, using dm-stripe, with an fstrim service running automatically on a
> weekly basis (default SUSE setting, in case the system has one or more
> SSD).
> 
> The details are beyond me to be honest, but my limited understanding is
> that calling TRIM on virtual block devices for which TRIM makes no
> sense ended up zeroing storage areas and causing the corruption. Hannes
> or Jeff may be able to explain it better, hopefully.
> 

That's more or less how the problem presents itself, but it's not that
we issue an fstrim on a block and it gets zeroed.  That would be fine
from the file system perspective.  It's that the fstrim is issued and
unrelated blocks are zeroed.  Once I saw the storage layer was
responsible, I stopped digging and passed it to Hannes.

-Jeff
Bart Van Assche Feb. 19, 2019, 4:50 p.m. UTC | #7
On Tue, 2019-02-19 at 10:07 -0500, Jeff Mahoney wrote:
> That's more or less how the problem presents itself, but it's not that
> we issue an fstrim on a block and it gets zeroed.  That would be fine
> from the file system perspective.  It's that the fstrim is issued and
> unrelated blocks are zeroed.  Once I saw the storage layer was
> responsible, I stopped digging and passed it to Hannes.

How can issuing an fstrim cause unrelated blocks to be zeroed? I think
the root cause needs to be identified before we continue with this patch.

Thanks,

Bart.
Martin K. Petersen Feb. 20, 2019, 12:35 a.m. UTC | #8
Jean,

> So basically the condition "if (!sdkp->lbpme)" should be replaced by
> something like "if (LBP_VPD_page_is_supported && !sdkp->lbpme)"?

We can not depend on the VPD page here. It was added to the spec long
after the LBPME flag.
Martin K. Petersen Feb. 20, 2019, 1:14 a.m. UTC | #9
Jeff,

> That's more or less how the problem presents itself, but it's not that
> we issue an fstrim on a block and it gets zeroed.  That would be fine
> from the file system perspective.  It's that the fstrim is issued and
> unrelated blocks are zeroed.  Once I saw the storage layer was
> responsible, I stopped digging and passed it to Hannes.

--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -411,6 +411,13 @@ provisioning_mode_store(struct device *d
	if (mode < 0)
		return -EINVAL;

+	/*
+	 * If logical block provisioning isn't enabled we can only
+	 * select 'disable' here.
+	 */
+	if (!sdkp->lbpme && mode != SD_LBP_DISABLE)
+		return -EINVAL;
+
	sd_config_discard(sdkp, mode);

	return count;

This first hunk disables the ability for the user to set the
provisioning_mode to SD_LBP_ZERO. Zero mode is only triggered through
sysfs and is never set by the kernel since there is no way we can detect
if a device is capable. Zero mode predates any of the thin provisioning
knobs being added to the SCSI protocol. This mode is how enterprise
arrays used to work, and is akin to zero block detection on modern SSDs.

@@ -2942,8 +2949,10 @@ static void sd_read_block_limits(struct

		sdkp->max_ws_blocks = (u32)get_unaligned_be64(&buffer[36]);

-		if (!sdkp->lbpme)
+		if (!sdkp->lbpme) {
+			sd_config_discard(sdkp, SD_LBP_DISABLE);
			goto out;
+		}

		lba_count = get_unaligned_be32(&buffer[20]);
		desc_count = get_unaligned_be32(&buffer[24]);

The only place we ever set lbpme (and thus enable discard) is if the
device sets the LBPME field in READ CAPACITY(16). So there should not be
any way for this code to do anything. If sdkp->lbpme is unset, discard
is already disabled (unless the device temporarily reports lbpme and
then switches it off!?).

Under the assumption that this patch actually fixed the problem, it
smells like there was a udev or user script at play here that twiddled
the provisioning mode in sysfs.

I concur that the discard detection logic is gnarly because we support
several iterations of what turned out to be a constantly moving target
in the SCSI spec. But I don't see how the kernel would go down the path
of automatically enabling discards unless the device reported LBPME in
the first place.

Definitely need more data.
Martin K. Petersen Feb. 20, 2019, 1:16 a.m. UTC | #10
Bart,

> How can issuing an fstrim cause unrelated blocks to be zeroed? I think
> the root cause needs to be identified before we continue with this
> patch.

Absolutely!
diff mbox series

Patch

--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -411,6 +411,13 @@  provisioning_mode_store(struct device *d
 	if (mode < 0)
 		return -EINVAL;
 
+	/*
+	 * If logical block provisioning isn't enabled we can only
+	 * select 'disable' here.
+	 */
+	if (!sdkp->lbpme && mode != SD_LBP_DISABLE)
+		return -EINVAL;
+
 	sd_config_discard(sdkp, mode);
 
 	return count;
@@ -2942,8 +2949,10 @@  static void sd_read_block_limits(struct
 
 		sdkp->max_ws_blocks = (u32)get_unaligned_be64(&buffer[36]);
 
-		if (!sdkp->lbpme)
+		if (!sdkp->lbpme) {
+			sd_config_discard(sdkp, SD_LBP_DISABLE);
 			goto out;
+		}
 
 		lba_count = get_unaligned_be32(&buffer[20]);
 		desc_count = get_unaligned_be32(&buffer[24]);