diff mbox

block: correctly fallback for zeroout

Message ID 20160526180813.GA49039@shli-mbp.local (mailing list archive)
State New, archived
Headers show

Commit Message

Shaohua Li May 26, 2016, 6:08 p.m. UTC
blkdev_issue_zeroout try discard/writesame first, if they fail, zeroout
fallback to regular write. The problem is discard/writesame doesn't
return error for -EOPNOTSUPP, then zeroout can't do fallback and leave
disk data not changed. zeroout should have guaranteed zero-fill
behavior.

BTW, I saw several callers of blkdev_issue_discard can handle
-EOPNOTSUPP, not sure why blkdev_issue_discard not returns -EOPNOTSUPP.
The same story for blkdev_issue_write_same.

https://bugzilla.kernel.org/show_bug.cgi?id=118581

Cc: Sitsofe Wheeler <sitsofe@yahoo.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-lib.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

Comments

Christoph Hellwig May 29, 2016, 6:47 a.m. UTC | #1
On Thu, May 26, 2016 at 11:08:14AM -0700, Shaohua Li wrote:
> -int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> -		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags)
> +static int do_blkdev_issue_discard(struct block_device *bdev, sector_t sector,

We've split blkdev_issue_discard into __blkdev_issue_discard and a small
wrapper around in for 4.7, so this will need a bit of an update.

As part of that I also removed the strange EOPNOTSUPP ignore, but Mike
reverted it just because it changed something in the dm testsuite.

I still believe we should never ignore it in this helper, and only
do so in callers that believe it's the right thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen June 3, 2016, 2:56 a.m. UTC | #2
>>>>> "Sitsofe" == Sitsofe Wheeler <sitsofe@yahoo.com> writes:

Sitsofe> The original SCSI WRITE SAME has overloaded semantics - not
Sitsofe> only does it mean "write this data multiple times" but it can
Sitsofe> also be used to mean "discard this range" too. If the kernel's
Sitsofe> command was modelled on the SCSI original perhaps this
Sitsofe> conflation clouded things?

REQ_WRITE_SAME in the context of the kernel explicitly means "write
payload to this block range".

A REQ_DISCARD command may be serviced using WRITE SAME(16) with the
UNMAP bit set in the SCSI disk driver but that's entirely orthogonal.
Martin K. Petersen June 3, 2016, 3:06 a.m. UTC | #3
>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

Christoph> As part of that I also removed the strange EOPNOTSUPP ignore,
Christoph> but Mike reverted it just because it changed something in the
Christoph> dm testsuite.

Mike?

Christoph> I still believe we should never ignore it in this helper, and
Christoph> only do so in callers that believe it's the right thing.

Yeah.

I really wish EOPNOTSUPP would just go away except for ioctl callers.
Now that we have real bi_error I don't understand why we need it.
Mike Snitzer June 3, 2016, 3:54 a.m. UTC | #4
On Thu, Jun 02 2016 at 11:06pm -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:
> 
> Christoph> As part of that I also removed the strange EOPNOTSUPP ignore,
> Christoph> but Mike reverted it just because it changed something in the
> Christoph> dm testsuite.
> 
> Mike?

Yes? ;)

Seems there is some serious confusion going on here.  The entirety of
hch's post (to which you quoted a subset) makes little sense to me.

shli's patch builds ontop of latest blk-lib.c code yet hch said this::
"We've split blkdev_issue_discard into __blkdev_issue_discard and a
small wrapper around in for 4.7, so this will need a bit of an update."

And hch never "removed the strange EOPNOTSUPP ignore".  He preserved it
(see his commit 38f25255330's "return ret != -EOPNOTSUPP ? ret : 0;"
that I adjusted in commit bbd848e0f -- _and_ he expanded it to eat the
early return that I restored).

So I can only infer that hch is still missing why my revert fixes
historic blkdev_issue_discard() behavior that his commit regressed.
Please read commit bbd848e0f's header.  That at least details the early
vs late -EOPNOTSUPP blkdev_issue_discard() return.

But all that nuance aside, AFAICT my commit bbd848e0f ("block: reinstate
early return of -EOPNOTSUPP from blkdev_issue_discard") really has
_nothing_ to do with the issue shli is addressing with his fix.

> Christoph> I still believe we should never ignore it in this helper, and
> Christoph> only do so in callers that believe it's the right thing.
> 
> Yeah.

Hmm...

You agreed to what hch said there about how we should probably always
return EOPNOTSUPP but then you immediately elaborated with details that
mean you don't agree:
 
> I really wish EOPNOTSUPP would just go away except for ioctl callers.
> Now that we have real bi_error I don't understand why we need it.

But hch was originally in favor of _always_ dropping EOPNOTSUPP on the
floor (that is what his commit 38f25255330 did).  Then he said he
disagrees with these interfaces playing games with masking EOPNOTSUPP --
to which you seemingly really don't agree.  Unless I'm completely
misreading you.

Anyway, shli is at least making it so that blkdev_issue_zerout() can
fallback to other mechanisms as needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen June 7, 2016, 2:32 a.m. UTC | #5
>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> But hch was originally in favor of _always_ dropping EOPNOTSUPP on
Mike> the floor (that is what his commit 38f25255330 did).  Then he said
Mike> he disagrees with these interfaces playing games with masking
Mike> EOPNOTSUPP -- to which you seemingly really don't agree.  Unless
Mike> I'm completely misreading you.

Userland apps rely on EOPNOTSUPP, we can't break that.

What I don't like this is "soft" error special casing of EOPNOTSUPP in
the actual implementation of discard, write same, etc. These functions
should return either success or failure. And the ioctl wrapper should
then decide whether to return EOPNOTSUPP, EIO or EPONIES.

I.e. separate the policy from the implementation. This would also solve
some of the grievances for the target folks.
Christoph Hellwig June 7, 2016, 6:38 a.m. UTC | #6
On Mon, Jun 06, 2016 at 10:32:38PM -0400, Martin K. Petersen wrote:
> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> But hch was originally in favor of _always_ dropping EOPNOTSUPP on
> Mike> the floor (that is what his commit 38f25255330 did).  Then he said
> Mike> he disagrees with these interfaces playing games with masking
> Mike> EOPNOTSUPP -- to which you seemingly really don't agree.  Unless
> Mike> I'm completely misreading you.
> 
> Userland apps rely on EOPNOTSUPP, we can't break that.

Rely on what exactly?  Current we return EOPNOTSUPP if the device
doesn't claim to support discards, but it returns 0 if the device first
claims to support it but then fails the I/O.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen June 10, 2016, 2:05 a.m. UTC | #7
>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

>> Userland apps rely on EOPNOTSUPP, we can't break that.

Christoph> Rely on what exactly?  Current we return EOPNOTSUPP if the
Christoph> device doesn't claim to support discards, but it returns 0 if
Christoph> the device first claims to support it but then fails the I/O.

Hopefully we can clean up this when/if we go the fallocate() route.
diff mbox

Patch

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 23d7f30..232f9ea 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -95,8 +95,9 @@  EXPORT_SYMBOL(__blkdev_issue_discard);
  * Description:
  *    Issue a discard request for the sectors in question.
  */
-int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
-		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags)
+static int do_blkdev_issue_discard(struct block_device *bdev, sector_t sector,
+	sector_t nr_sects, gfp_t gfp_mask, unsigned long flags,
+	bool ignore_nosupport)
 {
 	int type = REQ_WRITE | REQ_DISCARD;
 	struct bio *bio = NULL;
@@ -111,13 +112,20 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 			&bio);
 	if (!ret && bio) {
 		ret = submit_bio_wait(type, bio);
-		if (ret == -EOPNOTSUPP)
+		if (ignore_nosupport && ret == -EOPNOTSUPP)
 			ret = 0;
 	}
 	blk_finish_plug(&plug);
 
 	return ret;
 }
+
+int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags)
+{
+	return do_blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
+			flags, true);
+}
 EXPORT_SYMBOL(blkdev_issue_discard);
 
 /**
@@ -131,9 +139,9 @@  EXPORT_SYMBOL(blkdev_issue_discard);
  * Description:
  *    Issue a write same request for the sectors in question.
  */
-int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
+static int do_blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 			    sector_t nr_sects, gfp_t gfp_mask,
-			    struct page *page)
+			    struct page *page, bool ignore_nosupport)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
 	unsigned int max_write_same_sectors;
@@ -167,7 +175,15 @@  int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 
 	if (bio)
 		ret = submit_bio_wait(REQ_WRITE | REQ_WRITE_SAME, bio);
-	return ret != -EOPNOTSUPP ? ret : 0;
+	return (ret != -EOPNOTSUPP || !ignore_nosupport) ? ret : 0;
+}
+
+int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
+			    sector_t nr_sects, gfp_t gfp_mask,
+			    struct page *page)
+{
+	return do_blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
+		page, true);
 }
 EXPORT_SYMBOL(blkdev_issue_write_same);
 
@@ -238,12 +254,13 @@  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	struct request_queue *q = bdev_get_queue(bdev);
 
 	if (discard && blk_queue_discard(q) && q->limits.discard_zeroes_data &&
-	    blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0) == 0)
+	    do_blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0,
+					false) == 0)
 		return 0;
 
 	if (bdev_write_same(bdev) &&
-	    blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
-				    ZERO_PAGE(0)) == 0)
+	    do_blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
+				    ZERO_PAGE(0), false) == 0)
 		return 0;
 
 	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);