[02/14] scsi: core: Query VPD size before getting full page

Message ID	20220302053559.32147-3-martin.petersen@oracle.com (mailing list archive)
State	Changes Requested
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: "Martin K. Petersen" <martin.petersen@oracle.com> To: linux-scsi@vger.kernel.org Cc: "Martin K. Petersen" <martin.petersen@oracle.com>, "Maciej W . Rozycki" <macro@orcam.me.uk> Subject: [PATCH 02/14] scsi: core: Query VPD size before getting full page Date: Wed, 2 Mar 2022 00:35:47 -0500 Message-Id: <20220302053559.32147-3-martin.petersen@oracle.com> In-Reply-To: <20220302053559.32147-1-martin.petersen@oracle.com> References: <20220302053559.32147-1-martin.petersen@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[01/14] scsi: mpt3sas: Use cached ATA Information VPD page \| expand [01/14] scsi: mpt3sas: Use cached ATA Information VPD page [02/14] scsi: core: Query VPD size before getting full page [03/14] scsi: core: Do not truncate INQUIRY data on modern devices [04/14] scsi: core: Pick suitable allocation length in scsi_report_opcode() [05/14] scsi: core: Cache VPD pages b0, b1, b2 [06/14] scsi: sd: Use cached ATA Information VPD page [07/14] scsi: sd: Switch to using scsi_device VPD pages [08/14] scsi: sd: Optimal I/O size should be a multiple of reported granularity [09/14] scsi: sd: Fix discard errors during revalidate [10/14] scsi: sd: Move WRITE_ZEROES configuration to a separate function [11/14] scsi: sd: Implement support for NDOB flag in WRITE SAME(16) [12/14] scsi: sd: sd_read_cpr() requires VPD pages [13/14] scsi: sd: Reorganize DIF/DIX code to avoid calling revalidate twice [14/14] scsi: sd: Enable modern protocol features on more devices

Martin K. Petersen March 2, 2022, 5:35 a.m. UTC

We currently default to 255 bytes when fetching VPD pages during discovery.
However, we have had a few devices that are known to wedge if the requested
buffer exceeds a certain size. See commit af73623f5f10 ("[SCSI] sd: Reduce
buffer size for vpd request") which works around one example of this
problem in the SCSI disk driver.

With commit d188b0675b21 ("scsi: core: Add sysfs attributes for VPD pages
0h and 89h") we now risk triggering the same issue in the generic midlayer
code.

The problem with the ATA VPD page in particular is that the SCSI portion of
the page is trailed by 512 bytes of verbatim ATA Identify Device
information.  However, not all controllers actually provide the additional
512 bytes and will lock up if one asks for more than the 64 bytes
containing the SCSI protocol fields.

Instead of picking a new, somewhat arbitrary, number of bytes for the VPD
buffer size, start fetching the 4-byte header for each page. The header
contains the size of the page as far as the device is concerned. We can use
the reported size to specify the correct allocation length when
subsequently fetching the full page.

The header validation is done by a new helper function scsi_get_vpd_size()
and both scsi_get_vpd_page() and scsi_get_vpd_buf() now rely on this to
query the page size.

In addition, scsi_get_vpd_page() is simplified to mirror the logic in
scsi_get_vpd_page(). This involves removing the Supported VPD Pages lookup
prior to attempting to query a page. There does not appear any evidence,
even in the oldest SCSI specs, that this step is required. We already rely
on scsi_get_vpd_page() throughout the stack and this function never
consulted the Supported VPD Pages. Since this has not caused any problems
it should be safe to remove the precondition from scsi_get_vpd_page().

Instrumented runs also revealed that the Supported VPD Pages lookup had
little effect since the device page index often was larger than the
supplied buffer size. As a result, inquiries frequently bypassed the index
check and went through the "If we ran off the end of the buffer, give us
the benefit of the doubt" code path which assumed the page was present
despite not being listed. The revised code takes both the page size
reported by the device as well as the size of the buffer provided by the
scsi_get_vpd_page() caller into account.

Fixes: d188b0675b21 ("scsi: core: Add sysfs attributes for VPD pages 0h and 89h")
Reported-by: Maciej W. Rozycki <macro@orcam.me.uk>
Tested-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/scsi/scsi.c        | 94 +++++++++++++++++++++++++-------------
 include/scsi/scsi_device.h |  5 +-
 2 files changed, 67 insertions(+), 32 deletions(-)

Christoph Hellwig March 2, 2022, 9:48 a.m. UTC | #1

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Johannes Thumshirn March 2, 2022, 2:25 p.m. UTC | #2

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Two minior nits below:

> +	result = scsi_vpd_inquiry(sdev, vpd_header, page, sizeof(vpd_header));
> +
> +	if (result < 0)
> +		return 0;

can we remove the empty line?

> +	memset(buf, 0, buf_len);
> +	result = scsi_vpd_inquiry(sdev, buf, page, vpd_len);
>  	if (result < 0)
> -		goto fail;
> +		return -EINVAL;
> +	else if (result > vpd_len) {
> +		dev_warn_once(&sdev->sdev_gendev,
> +			      "%s: VPD page 0x%02x result %d > %d bytes\n",
> +			      __func__, page, result, vpd_len);
> +		vpd_len = min(result, buf_len);
> +		goto retry_pg;
> +	}
>  

Adding {} to the if block as well makes it look nicer IMHO

Bart Van Assche March 3, 2022, 12:30 a.m. UTC | #3

On 3/1/22 21:35, Martin K. Petersen wrote:
> +static int scsi_get_vpd_size(struct scsi_device *sdev, u8 page)
> +{
> +	unsigned char vpd_header[SCSI_VPD_HEADER_SIZE];
> +	int result;

The SCSI core sets the minimum DMA alignment to 4 bytes. How about 
aligning the vpd_header[] array explicitly to a four-byte boundary such 
that the block layer does not have to allocate a temporary buffer?

> +	vpd_len = min(vpd_len, buf_len);
>   
> - found:
> -	result = scsi_vpd_inquiry(sdev, buf, page, buf_len);
> +retry_pg:
> +	/*
> +	 * Fetch the actual page. Since the appropriate size was reported
> +	 * by the device it is now safe to ask for something bigger.
> +	 */
> +	memset(buf, 0, buf_len);
> +	result = scsi_vpd_inquiry(sdev, buf, page, vpd_len);
>   	if (result < 0)
> -		goto fail;
> +		return -EINVAL;
> +	else if (result > vpd_len) {
> +		dev_warn_once(&sdev->sdev_gendev,
> +			      "%s: VPD page 0x%02x result %d > %d bytes\n",
> +			      __func__, page, result, vpd_len);
> +		vpd_len = min(result, buf_len);
> +		goto retry_pg;
> +	}

Will an endless loop be triggered if the VPD page length is larger than 
'buf_len'?

Thanks,

Bart.

Martin K. Petersen March 4, 2022, 3:28 a.m. UTC | #4

Bart,

>> +	vpd_len = min(vpd_len, buf_len);
>>   - found:
>> -	result = scsi_vpd_inquiry(sdev, buf, page, buf_len);
>> +retry_pg:
>> +	/*
>> +	 * Fetch the actual page. Since the appropriate size was reported
>> +	 * by the device it is now safe to ask for something bigger.
>> +	 */
>> +	memset(buf, 0, buf_len);
>> +	result = scsi_vpd_inquiry(sdev, buf, page, vpd_len);
>>   	if (result < 0)
>> -		goto fail;
>> +		return -EINVAL;
>> +	else if (result > vpd_len) {
>> +		dev_warn_once(&sdev->sdev_gendev,
>> +			      "%s: VPD page 0x%02x result %d > %d bytes\n",
>> +			      __func__, page, result, vpd_len);
>> +		vpd_len = min(result, buf_len);
>> +		goto retry_pg;
>> +	}
>
> Will an endless loop be triggered if the VPD page length is larger
> than 'buf_len'?

Ah, transplant thinko from scsi_get_vpd_buf() which reallocates the
buffer on mismatch. Will fix.

Thanks!

Martin K. Petersen March 4, 2022, 3:42 a.m. UTC | #5

Johannes,

> Two minior nits below:
>
>> +	result = scsi_vpd_inquiry(sdev, vpd_header, page, sizeof(vpd_header));
>> +
>> +	if (result < 0)
>> +		return 0;
>
> can we remove the empty line?

Sure!

>> +	memset(buf, 0, buf_len);
>> +	result = scsi_vpd_inquiry(sdev, buf, page, vpd_len);
>>  	if (result < 0)
>> -		goto fail;
>> +		return -EINVAL;
>> +	else if (result > vpd_len) {
>> +		dev_warn_once(&sdev->sdev_gendev,
>> +			      "%s: VPD page 0x%02x result %d > %d bytes\n",
>> +			      __func__, page, result, vpd_len);
>> +		vpd_len = min(result, buf_len);
>> +		goto retry_pg;
>> +	}
>>  
>
> Adding {} to the if block as well makes it look nicer IMHO

Roger.

Aaron Lauterer June 4, 2024, 12:12 p.m. UTC | #6

Hi!

On a customer system, we found that after commit

  c92a6b5d63359dd6d2ce6ea88ecd8e31dd769f6b (scsi: core: Query VPD size 
before getting full page)

`qemu-img convert` writing to a certain HDD would be very slow when zero 
blocks are involved. What the commit seems to have changed is make it 
possible to issue `fallocate` calls with 
"FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE" flags.
Without the `fallocate` support, it would use an alternative method to 
write out zeroes that is considerably faster.

The question is if allowing `fallocate` in this case is an intentional 
effect of commit c92a6b5d?

The target is an HDD behind a MegaRAID controller configured as JBOD. We 
see the same behavior if LVM is used on top of the disk.

These calls are rather slow on this system, especially when the offset 
is not aligned to 4k.

2 MiB fallocate 4k aligned:
---
root@pve1:~# time fallocate --punch-hole --keep-size -l 2M /dev/sda 
--offset 4M

real    0m0.030s
user    0m0.001s
sys     0m0.000s
---

2 MiB fallocate at 4 MiB - 512 byte:
---
root@pve1:~# time fallocate --punch-hole --keep-size -l 2M /dev/sda 
--offset 4294966784

real    0m0.707s
user    0m0.000s
sys     0m0.001s
---

In our case, `qemu-img convert` introduces a misalignment of -512 bytes 
every 2 GiB because it handles up to INT_MAX (2^31-1) 512 byte sectors 
at a time.
Improving the alignment in `qemu-img convert` is another approach we are 
considering.

During boot (no matter which of the tested kernel versions) we see the 
following log line:
---
Jun 03 15:28:32 pve1 kernel: pci 0000:01:00.0: [Firmware Bug]: disabling 
  VPD access (can't determine size of non-standard VPD format)
---

That PCI address is the controller to which the disk is connected:
---
root@pve1:~# lspci -nnk -s 01:00.0
01:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3008 
[Fury] [1000:005f] (rev 02)
         Subsystem: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] [1000:9343]
         Kernel driver in use: megaraid_sas
         Kernel modules: megaraid_sas
---
The system in question is using a MegaRAID 9341-4i controller with the 
following firmware versions:
FW Package Build = 24.21.0-0159
BIOS Version = 6.36.00.3_4.19.08.00_0x06180206
FW Version = 4.680.01-8576


A test with kernel 6.10-rc2 did show that fallocate support is still 
detected.

I found the following other bug report that looks similar [0] from 
February. But the mentioned patch [1] (if this is the correct one) seems 
to be already included in recent kernel versions.

The sg_vpd info, as was requested in [0] for this system follows. [2]
We checked the output of `sg_vpd -a /dev/sda` and the output does not 
differ between a kernel that allows fallocate and one that doesn't.

If you need additional information or tests, please let me know. I have 
access to the affected system.

Thanks and kind regards,
Aaron

[0] 
https://lore.kernel.org/all/20240210011831.47f55oe67utq2yr7@altlinux.org/T/#mb37c7d063c07d356d11111231fac4c048d7678e5

[1] 
https://lore.kernel.org/linux-scsi/20240214221411.2888112-1-martin.petersen@oracle.com/

[2]
root@pve1:~# sg_readcap -l /dev/sda
Read Capacity results:
    Protection: prot_en=0, p_type=0, p_i_exponent=0
    Logical block provisioning: lbpme=0, lbprz=0
    Last LBA=7814037167 (0x1d1c0beaf), Number of logical blocks=7814037168
    Logical block length=512 bytes
    Logical blocks per physical block exponent=3 [so physical block 
length=4096 bytes]
    Lowest aligned LBA=0
Hence:
    Device size: 4000787030016 bytes, 3815447.8 MiB, 4000.79 GB, 4.00 TB


root@pve1:~# sg_vpd -l /dev/sda
Supported VPD pages VPD page:
    [PQual=0  Peripheral device type: disk]
   0x00  Supported VPD pages [sv]
   0x80  Unit serial number [sn]
   0x83  Device identification [di]
   0x87  Mode page policy [mpp]
   0x89  ATA information (SAT) [ai]
   0x8a  Power condition [pc]
   0xb0  Block limits (SBC) [bl]
   0xb1  Block device characteristics (SBC) [bdc]
   0xb2  Logical block provisioning (SBC) [lbpv]


root@pve1:~# sg_vpd -p 0xb0 /dev/sda
Block limits VPD page (SBC):
   Write same non-zero (WSNZ): 0
   Maximum compare and write length: 0 blocks [Command not implemented]
   Optimal transfer length granularity: 0 blocks [not reported]
   Maximum transfer length: 0 blocks [not reported]
   Optimal transfer length: 0 blocks [not reported]
   Maximum prefetch transfer length: 0 blocks [ignored]
   Maximum unmap LBA count: 0 [Unmap command not implemented]
   Maximum unmap block descriptor count: 0 [Unmap command not implemented]
   Optimal unmap granularity: 0 blocks [not reported]
   Unmap granularity alignment valid: false
   Unmap granularity alignment: 0 [invalid]
   Maximum write same length: 0 blocks [not reported]
   Maximum atomic transfer length: 0 blocks [not reported]
   Atomic alignment: 0 [unaligned atomic writes permitted]
   Atomic transfer length granularity: 0 [no granularity requirement
   Maximum atomic transfer length with atomic boundary: 0 blocks [not 
reported]
   Maximum atomic boundary size: 0 blocks [can only write atomic 1 block]


root@pve1:~# sg_vpd -p 0xb1 /dev/sda
Block device characteristics VPD page (SBC):
   Nominal rotation rate: 7200 rpm
   Product type: Not specified
   WABEREQ=0
   WACEREQ=0
   Nominal form factor: 3.5 inch
   ZONED=0
   RBWZ=0
   BOCS=0
   FUAB=0
   VBULS=0
   DEPOPULATION_TIME=0 (seconds)


root@pve1:~# sg_vpd -p 0xb2 /dev/sda
Logical block provisioning VPD page (SBC):
   Unmap command supported (LBPU): 0
   Write same (16) with unmap bit supported (LBPWS): 0
   Write same (10) with unmap bit supported (LBPWS10): 0
   Logical block provisioning read zeros (LBPRZ): 0
   Anchored LBAs supported (ANC_SUP): 0
   Threshold exponent: 0 [threshold sets not supported]
   Descriptor present (DP): 0
   Minimum percentage: 0 [not reported]
   Provisioning type: 0 (not known or fully provisioned)
   Threshold percentage: 0 [percentages not supported]


On  2022-03-02  06:35, Martin K. Petersen wrote:
> We currently default to 255 bytes when fetching VPD pages during discovery.
> However, we have had a few devices that are known to wedge if the requested
> buffer exceeds a certain size. See commit af73623f5f10 ("[SCSI] sd: Reduce
> buffer size for vpd request") which works around one example of this
> problem in the SCSI disk driver.
> 
> With commit d188b0675b21 ("scsi: core: Add sysfs attributes for VPD pages
> 0h and 89h") we now risk triggering the same issue in the generic midlayer
> code.
> 
> The problem with the ATA VPD page in particular is that the SCSI portion of
> the page is trailed by 512 bytes of verbatim ATA Identify Device
> information.  However, not all controllers actually provide the additional
> 512 bytes and will lock up if one asks for more than the 64 bytes
> containing the SCSI protocol fields.
> 
> Instead of picking a new, somewhat arbitrary, number of bytes for the VPD
> buffer size, start fetching the 4-byte header for each page. The header
> contains the size of the page as far as the device is concerned. We can use
> the reported size to specify the correct allocation length when
> subsequently fetching the full page.
> 
> The header validation is done by a new helper function scsi_get_vpd_size()
> and both scsi_get_vpd_page() and scsi_get_vpd_buf() now rely on this to
> query the page size.
> 
> In addition, scsi_get_vpd_page() is simplified to mirror the logic in
> scsi_get_vpd_page(). This involves removing the Supported VPD Pages lookup
> prior to attempting to query a page. There does not appear any evidence,
> even in the oldest SCSI specs, that this step is required. We already rely
> on scsi_get_vpd_page() throughout the stack and this function never
> consulted the Supported VPD Pages. Since this has not caused any problems
> it should be safe to remove the precondition from scsi_get_vpd_page().
> 
> Instrumented runs also revealed that the Supported VPD Pages lookup had
> little effect since the device page index often was larger than the
> supplied buffer size. As a result, inquiries frequently bypassed the index
> check and went through the "If we ran off the end of the buffer, give us
> the benefit of the doubt" code path which assumed the page was present
> despite not being listed. The revised code takes both the page size
> reported by the device as well as the size of the buffer provided by the
> scsi_get_vpd_page() caller into account.
> 
> Fixes: d188b0675b21 ("scsi: core: Add sysfs attributes for VPD pages 0h and 89h")
> Reported-by: Maciej W. Rozycki <macro@orcam.me.uk>
> Tested-by: Maciej W. Rozycki <macro@orcam.me.uk>
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

Martin K. Petersen June 5, 2024, 2:51 a.m. UTC | #7

Aaron,

> The target is an HDD behind a MegaRAID controller configured as JBOD.

Running in JBOD mode is probably part of the problem.

> root@pve1:~# sg_vpd -l /dev/sda
> Supported VPD pages VPD page:
>    [PQual=0  Peripheral device type: disk]
>   0x00  Supported VPD pages [sv]
>   0x80  Unit serial number [sn]
>   0x83  Device identification [di]
>   0x87  Mode page policy [mpp]
>   0x89  ATA information (SAT) [ai]
>   0x8a  Power condition [pc]
>   0xb0  Block limits (SBC) [bl]
>   0xb1  Block device characteristics (SBC) [bdc]
>   0xb2  Logical block provisioning (SBC) [lbpv]

I am working on a fix for what's probably a related issue.

I would appreciate if you could do two things:

1. Please send me the output of:

   # sg_opcodes /dev/sda

2. Try building and booting:

   https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/log/?h=6.10/scsi-fixes

   and see if that makes a difference in your case.

Thanks!

Martin

Aaron Lauterer June 5, 2024, 7:39 a.m. UTC | #8

Hi Martin,

thanks for getting back to me.

On  2024-06-05  04:51, Martin K. Petersen wrote:
> 
> Aaron,
> 
>> The target is an HDD behind a MegaRAID controller configured as JBOD.
> 
> Running in JBOD mode is probably part of the problem.
> 
>> root@pve1:~# sg_vpd -l /dev/sda
>> Supported VPD pages VPD page:
>>     [PQual=0  Peripheral device type: disk]
>>    0x00  Supported VPD pages [sv]
>>    0x80  Unit serial number [sn]
>>    0x83  Device identification [di]
>>    0x87  Mode page policy [mpp]
>>    0x89  ATA information (SAT) [ai]
>>    0x8a  Power condition [pc]
>>    0xb0  Block limits (SBC) [bl]
>>    0xb1  Block device characteristics (SBC) [bdc]
>>    0xb2  Logical block provisioning (SBC) [lbpv]
> 
> I am working on a fix for what's probably a related issue.
> 
> I would appreciate if you could do two things:
> 
> 1. Please send me the output of:
> 
>     # sg_opcodes /dev/sda

Here you go:
---
root@pve1:~# sg_opcodes /dev/sda
   ATA       HGST HUS726T4TAL  W984
   Peripheral device type: disk
Report supported operation codes: Illegal request, Invalid opcode
---

I get the same result for the other 2 block devices configured on this 
RAID controller. One is another JBOD device (an SSD), the other is a RAID-1.

> 
> 2. Try building and booting:
> 
>     https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/log/?h=6.10/scsi-fixes
> 
>     and see if that makes a difference in your case.

I checked out the 6.10/scsi-fixes branch which was on commit

  d53b681 scsi: ufs: mcq: Fix error output and clean up ufshcd_mcq_abort()

to build a new kernel to test.

Unfortunately, I must report that running the `fallocate`[0] command 
still works and does not fail with an unsupported error.


If there is more I can do to help, please let me know.

Best regards,
Aaron

[0]
fallocate --punch-hole --keep-size -l 2M /dev/sda --offset 4M

[02/14] scsi: core: Query VPD size before getting full page

Commit Message

Comments

Patch