diff mbox series

[1/2] block: Read max write zeroes once for __blkdev_issue_write_zeroes()

Message ID 20240815082755.105242-2-john.g.garry@oracle.com (mailing list archive)
State New, archived
Headers show
Series block: Fix __blkdev_issue_write_zeroes() limit handling | expand

Commit Message

John Garry Aug. 15, 2024, 8:27 a.m. UTC
As reported in [0], we may get a hang when formatting a XFS FS on a RAID0
drive.

Commit 73a768d5f955 ("block: factor out a blk_write_zeroes_limit helper")
changed __blkdev_issue_write_zeroes() to read the max write zeroes
value in the loop. This is not safe as max write zeroes may change in
value. Specifically for the case of [0], the value goes to 0, and we get
an infinite loop.

Lift the limit reading out of the loop.

[0] https://lore.kernel.org/linux-xfs/4d31268f-310b-4220-88a2-e191c3932a82@oracle.com/T/#t

Fixes: 73a768d5f955 ("block: factor out a blk_write_zeroes_limit helper")
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-lib.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

Comments

Christoph Hellwig Aug. 15, 2024, 12:40 p.m. UTC | #1
On Thu, Aug 15, 2024 at 08:27:54AM +0000, John Garry wrote:
> +/*
> + * Pass bio_write_zeroes_limit() return value in @limit, as the return
> + * value may change after a REQ_OP_WRITE_ZEROES is issued.
> + */

I don't think that really helps all that much to explain the issue,
which is about SCSI not having an ahead of time flag that reliably
works for write same support, which makes it clear the limit to 0 on
the first I/O completion.  Maybe you can actually spell this out?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
John Garry Aug. 15, 2024, 1:29 p.m. UTC | #2
On 15/08/2024 13:40, Christoph Hellwig wrote:
> On Thu, Aug 15, 2024 at 08:27:54AM +0000, John Garry wrote:
>> +/*
>> + * Pass bio_write_zeroes_limit() return value in @limit, as the return
>> + * value may change after a REQ_OP_WRITE_ZEROES is issued.
>> + */
> I don't think that really helps all that much to explain the issue,
> which is about SCSI not having an ahead of time flag that reliably
> works for write same support, which makes it clear the limit to 0 on
> the first I/O completion.  Maybe you can actually spell this out?

Please just tell me what you would like to see, and I will copy verbatim.

cheers
Christoph Hellwig Aug. 15, 2024, 3:26 p.m. UTC | #3
On Thu, Aug 15, 2024 at 02:29:46PM +0100, John Garry wrote:
> On 15/08/2024 13:40, Christoph Hellwig wrote:
>> On Thu, Aug 15, 2024 at 08:27:54AM +0000, John Garry wrote:
>>> +/*
>>> + * Pass bio_write_zeroes_limit() return value in @limit, as the return
>>> + * value may change after a REQ_OP_WRITE_ZEROES is issued.
>>> + */
>> I don't think that really helps all that much to explain the issue,
>> which is about SCSI not having an ahead of time flag that reliably
>> works for write same support, which makes it clear the limit to 0 on
>> the first I/O completion.  Maybe you can actually spell this out?
>
> Please just tell me what you would like to see, and I will copy verbatim.

Probably just what I just wrote.  Unless Martin can come up with
better language.
John Garry Aug. 15, 2024, 3:38 p.m. UTC | #4
On 15/08/2024 16:26, Christoph Hellwig wrote:
>>>> +/*
>>>> + * Pass bio_write_zeroes_limit() return value in @limit, as the return
>>>> + * value may change after a REQ_OP_WRITE_ZEROES is issued.
>>>> + */
>>> I don't think that really helps all that much to explain the issue,
>>> which is about SCSI not having an ahead of time flag that reliably
>>> works for write same support, which makes it clear the limit to 0 on
>>> the first I/O completion.  Maybe you can actually spell this out?
>> Please just tell me what you would like to see, and I will copy verbatim.
> Probably just what I just wrote.  Unless Martin can come up with
> better language.


/*
  * SCSI does not have a reliable ahead of time flag that reliably
  * works for write same support; instead, the limit is set to zero for
  * the first failed I/O completion. As such, only read the limit prior
  * to submission, as it may later change.
  */

ok?
Martin K. Petersen Aug. 15, 2024, 3:41 p.m. UTC | #5
Christoph,

>>>> +/*
>>>> + * Pass bio_write_zeroes_limit() return value in @limit, as the return
>>>> + * value may change after a REQ_OP_WRITE_ZEROES is issued.
>>>> + */
>>> I don't think that really helps all that much to explain the issue,
>>> which is about SCSI not having an ahead of time flag that reliably
>>> works for write same support, which makes it clear the limit to 0 on
>>> the first I/O completion.  Maybe you can actually spell this out?
>>
>> Please just tell me what you would like to see, and I will copy verbatim.
>
> Probably just what I just wrote.  Unless Martin can come up with
> better language.

There is no reliable way for the SCSI subsystem to determine whether a
device supports a WRITE SAME operation without actually performing a
write to media. As a result, write_zeroes is enabled by default and will
be disabled if a zeroing operation subsequently fails. This means that
this queue limit is likely to change at runtime.

Something like that, perhaps?
diff mbox series

Patch

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 9f735efa6c94..00cee94e5d42 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -111,13 +111,17 @@  static sector_t bio_write_zeroes_limit(struct block_device *bdev)
 		(UINT_MAX >> SECTOR_SHIFT) & ~bs_mask);
 }
 
+/*
+ * Pass bio_write_zeroes_limit() return value in @limit, as the return
+ * value may change after a REQ_OP_WRITE_ZEROES is issued.
+ */
 static void __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
-		struct bio **biop, unsigned flags)
+		struct bio **biop, unsigned flags, sector_t limit)
 {
+
 	while (nr_sects) {
-		unsigned int len = min_t(sector_t, nr_sects,
-				bio_write_zeroes_limit(bdev));
+		unsigned int len = min(nr_sects, limit);
 		struct bio *bio;
 
 		if ((flags & BLKDEV_ZERO_KILLABLE) &&
@@ -141,12 +145,14 @@  static void __blkdev_issue_write_zeroes(struct block_device *bdev,
 static int blkdev_issue_write_zeroes(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp, unsigned flags)
 {
+	sector_t limit = bio_write_zeroes_limit(bdev);
 	struct bio *bio = NULL;
 	struct blk_plug plug;
 	int ret = 0;
 
 	blk_start_plug(&plug);
-	__blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp, &bio, flags);
+	__blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp, &bio,
+			flags, limit);
 	if (bio) {
 		if ((flags & BLKDEV_ZERO_KILLABLE) &&
 		    fatal_signal_pending(current)) {
@@ -165,7 +171,7 @@  static int blkdev_issue_write_zeroes(struct block_device *bdev, sector_t sector,
 	 * on an I/O error, in which case we'll turn any error into
 	 * "not supported" here.
 	 */
-	if (ret && !bdev_write_zeroes_sectors(bdev))
+	if (ret && !limit)
 		return -EOPNOTSUPP;
 	return ret;
 }
@@ -265,12 +271,14 @@  int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
 		unsigned flags)
 {
+	sector_t limit = bio_write_zeroes_limit(bdev);
+
 	if (bdev_read_only(bdev))
 		return -EPERM;
 
-	if (bdev_write_zeroes_sectors(bdev)) {
+	if (limit) {
 		__blkdev_issue_write_zeroes(bdev, sector, nr_sects,
-				gfp_mask, biop, flags);
+				gfp_mask, biop, flags, limit);
 	} else {
 		if (flags & BLKDEV_ZERO_NOFALLBACK)
 			return -EOPNOTSUPP;