diff mbox

block: reintroduce discard_zeroes_data sysfs file and BLKDISCARDZEROES

Message ID 1502889581-19483-1-git-send-email-lczerner@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Lukas Czerner Aug. 16, 2017, 1:19 p.m. UTC
Discard and zeroout code has been significantly rewritten recently and
as a part of the rewrite we got rid o f the discard_zeroes_data flag.

With commit 48920ff2a5a9 ("block: remove the discard_zeroes_data flag")
discard_zeroes_data sysfs file and discard_zeroes_data ioctl now always
returns zero, regardless of what the device actually supports. This has
broken userspace utilities in a way that they will not take advantage of
this functionality even if the device actually supports it.

Now in order for user to figure out whether the device does suppot
deterministic read zeroes after discard without actually running
fallocate is to check for discard support (discard_max_bytes) and
zeroout hw offload (write_zeroes_max_bytes).

However we still have discard_zeroes_data sysfs file and
BLKDISCARDZEROES ioctl so I do not see any reason why not to do this
check in kernel and provide convenient and compatible way to continue to
export this information to use space.

With this patch both BLKDISCARDZEROES ioctl and discard_zeroes_data will
return 1 in the case that discard and hw offload for write zeroes is
supported. Otherwise it will return 0.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 Documentation/ABI/testing/sysfs-block | 11 +++++++++--
 Documentation/block/queue-sysfs.txt   |  5 +++++
 block/blk-sysfs.c                     |  5 ++++-
 block/ioctl.c                         |  6 +++++-
 4 files changed, 23 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig Aug. 16, 2017, 3:18 p.m. UTC | #1
The patch is incorrect - just because we support zeroing using
write zero doesn't mean that any discard will zero the data.

This subtle difference was one of the main point for the changes.
Lukas Czerner Aug. 16, 2017, 3:48 p.m. UTC | #2
On Wed, Aug 16, 2017 at 05:18:03PM +0200, Christoph Hellwig wrote:
> The patch is incorrect - just because we support zeroing using
> write zero doesn't mean that any discard will zero the data.
> 
> This subtle difference was one of the main point for the changes.

Ok, so the change breaks user space code. Do we have any documentation
of what's the right way to do things and what does what ?

I can see that you've made blkdev_fallocate() using discard and zeroout
with various flags. There is no documentation on what guarantees we have
when we use certain combination of flags.

Comments for __blkdev_issue_zeroout says

If a device is using logical block provisioning, the underlying
space will not be released if %flags contains BLKDEV_ZERO_NOUNMAP.

So does it mean that this only applies for SCSI UNMAP, not TRIM with
reliable RZAT ?

I'd like to be able to recognize where we have a device that does
support write zero with unmap, TRIM with RZAT and whatever else that
provides this.

-Lukas
Martin K. Petersen Aug. 17, 2017, 1:49 a.m. UTC | #3
Lukas,

> I'd like to be able to recognize where we have a device that does
> support write zero with unmap, TRIM with RZAT and whatever else that
> provides this.

The problem was that the original REQ_DISCARD was used both for
de-provisioning block ranges and for zeroing them. On top of that, the
storage standards tweaked the definitions a bit so the semantics became
even more confusing and harder to honor in the drivers.

As a result, we changed things so that discards are only used to
de-provision blocks. And the zeroout call/ioctl is used to zero block
ranges.

Which ATA/SCSI/NVMe command is issued on the back-end depends on what's
supported by the device and is hidden from the caller.

However, zeroout is guaranteed to return a zeroed block range on
subsequent reads. The blocks may be unmapped, anchored, written
explicitly, written with write same, or a combination thereof. But you
are guaranteed predictable results.

Whereas a discarded region may be sliced and diced and rounded off
before it hits the device. Which is then free to ignore all or parts of
the request.

Consequently, discard_zeroes_data is meaningless. Because there is no
guarantee that all of the discarded blocks will be acted upon. It
kinda-sorta sometimes worked (if the device was whitelisted, had a
reported alignment of 0, a granularity of 512 bytes, stacking didn't get
in the way, and you were lucky on the device end). But there were always
conditions.

So taking a step back: What information specifically were you trying to
obtain from querying that flag? And why do you need it?
Lukas Czerner Aug. 17, 2017, 7:47 a.m. UTC | #4
On Wed, Aug 16, 2017 at 09:49:22PM -0400, Martin K. Petersen wrote:
> e standards tweaked the definitions a bit so the semantics became
> even more confusing and harder to honor in the drivers.
> 
> As a result, we changed things so that discards are only used to
> de-provision blocks. And the zeroout call/ioctl is used to zero block
> ranges.
> 
> Which ATA/SCSI/NVMe command is issued on the back-end depends on what's
> supported by the device and is hidden from the caller.
> 
> However, zeroout is guaranteed to return a zeroed block range on
> subsequent reads. The blocks may be unmapped, anchored, written
> explicitly, written with write same, or a combination thereof. But you
> are guaranteed predictable results.
> 
> Whereas a discarded region may be sliced and diced and rounded off
> before it hits the device. Which is then free to ignore all or parts of
> the request.
> 
> Consequently, discard_zeroes_data is meaningless. Because there is no
> guarantee that all of the discarded blocks will be acted upon. It
> kinda-sorta sometimes worked (if the device was whitelisted, had a
> reported alignment of 0, a granularity of 512 bytes, stacking didn't get
> in the way, and you were lucky on the device end). But there were always
> conditions.

Thanks for the detailed explanation. That's wery usefull to know!

> 
> So taking a step back: What information specifically were you trying to
> obtain from querying that flag? And why do you need it?

There are many users that historically benefit from the
"discard_zeroes_data" semantics. For example mkfs, where it's beneficial
to discard the blocks before creating a file system and if we also get
deterministic zeroes on read, even better since we do not have to
initialize some portions of the file system manually.

The other example might be virtualization where they can support
efficient "Wipe After Delete" and "Enable Discard" in case that
"discard_zeroes_data". I am sure there are other examples.

So I understand now that Deterministic Read Zero after TRIM is not
realiable so we do not want to use that flag because we can't guarantee
it in this case. However there are other situations where we can such
as loop device (might be especially usefull for VM) where backing file
system supports punch hole, or even SCSI write same with UNMAP ?

Currently user space can call fallocate with FALLOC_FL_PUNCH_HOLE |
FALLOC_FL_KEEP_SIZE however if that succeeds we're only guaranteed that
the range has been zeroed, not unmapped/discarded ? (that's not very
clear from the comments). None of the modes seems to guarantee both
zeroout and unmap in case of success. However still, there seem to be no
way to tell what's actually supported from user space without ending up
calling fallocate, is there ? While before we had discard_zeroes_data
which people learned to rely on in certain situations, even though it
might have been shaky.

I actually like the rewrite the Christoph did, even though documentation
seems to be lacking. But I just wonder if it's possible to bring back
the former functionality, at least in some form.

Thanks!
-Lukas


> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering
Christoph Hellwig Aug. 17, 2017, 8:17 a.m. UTC | #5
On Thu, Aug 17, 2017 at 09:47:44AM +0200, Lukas Czerner wrote:
> There are many users that historically benefit from the
> "discard_zeroes_data" semantics. For example mkfs, where it's beneficial
> to discard the blocks before creating a file system and if we also get
> deterministic zeroes on read, even better since we do not have to
> initialize some portions of the file system manually.

But that's now what discard_zeroes_data gives you unfortunately.

> So I understand now that Deterministic Read Zero after TRIM is not
> realiable so we do not want to use that flag because we can't guarantee
> it in this case. However there are other situations where we can such
> as loop device (might be especially usefull for VM) where backing file
> system supports punch hole, or even SCSI write same with UNMAP ?
> 
> Currently user space can call fallocate with FALLOC_FL_PUNCH_HOLE |
> FALLOC_FL_KEEP_SIZE however if that succeeds we're only guaranteed that
> the range has been zeroed, not unmapped/discarded ? (that's not very
> clear from the comments). None of the modes seems to guarantee both
> zeroout and unmap in case of success. However still, there seem to be no
> way to tell what's actually supported from user space without ending up
> calling fallocate, is there ? While before we had discard_zeroes_data
> which people learned to rely on in certain situations, even though it
> might have been shaky.

You never get (and never got) a guarantee that the blocks were unmapped
as none of the storage protocol ever requires the device to deallocate.
Because devices have their internal chunk/block size everything else would
be impractival.

But fallocate FALLOC_FL_PUNCH_HOLE on a block device sets the REQ_UNAP
hints which asks the driver to unmap if at all possible.  Note that
unmap or not is not a binary decision - typical devices will deallocate
all whole blocks inside the range, and zero the rest.
Lukas Czerner Aug. 17, 2017, 8:41 a.m. UTC | #6
On Thu, Aug 17, 2017 at 10:17:11AM +0200, Christoph Hellwig wrote:
> On Thu, Aug 17, 2017 at 09:47:44AM +0200, Lukas Czerner wrote:
> > There are many users that historically benefit from the
> > "discard_zeroes_data" semantics. For example mkfs, where it's beneficial
> > to discard the blocks before creating a file system and if we also get
> > deterministic zeroes on read, even better since we do not have to
> > initialize some portions of the file system manually.
> 
> But that's now what discard_zeroes_data gives you unfortunately.
> 
> > So I understand now that Deterministic Read Zero after TRIM is not
> > realiable so we do not want to use that flag because we can't guarantee
> > it in this case. However there are other situations where we can such
> > as loop device (might be especially usefull for VM) where backing file
> > system supports punch hole, or even SCSI write same with UNMAP ?
> > 
> > Currently user space can call fallocate with FALLOC_FL_PUNCH_HOLE |
> > FALLOC_FL_KEEP_SIZE however if that succeeds we're only guaranteed that
> > the range has been zeroed, not unmapped/discarded ? (that's not very
> > clear from the comments). None of the modes seems to guarantee both
> > zeroout and unmap in case of success. However still, there seem to be no
> > way to tell what's actually supported from user space without ending up
> > calling fallocate, is there ? While before we had discard_zeroes_data
> > which people learned to rely on in certain situations, even though it
> > might have been shaky.
> 
> You never get (and never got) a guarantee that the blocks were unmapped
> as none of the storage protocol ever requires the device to deallocate.
> Because devices have their internal chunk/block size everything else would
> be impractival.

Ok, guarantee in this case is probably too strong expression. But you
know what I mean, saying "hey I am not using those blocks, feel free to
release them if you can". I did not encounter the need to be so strict
about discard, after all, it's for the device benefit.

However there are device that do both (discard and zeroout) in discard
request and that's what I am after. To be able to identify that's the
case and take advantage. I think it's possible now with fallocate
FALLOC_FL_PUNCH_HOLE, but I do not know anything in advance and I can't
tell whether the device attempted to unmap as well even if it did.

-Lukas
Christoph Hellwig Aug. 17, 2017, 9:52 a.m. UTC | #7
On Thu, Aug 17, 2017 at 10:41:43AM +0200, Lukas Czerner wrote:
> Ok, guarantee in this case is probably too strong expression. But you
> know what I mean, saying "hey I am not using those blocks, feel free to
> release them if you can". I did not encounter the need to be so strict
> about discard, after all, it's for the device benefit.

That's what the BLKDISCARD ioctl, or fallocate with the
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE
flags will do fr your.

> However there are device that do both (discard and zeroout) in discard
> request and that's what I am after. To be able to identify that's the
> case and take advantage. I think it's possible now with fallocate
> FALLOC_FL_PUNCH_HOLE, but I do not know anything in advance and I can't
> tell whether the device attempted to unmap as well even if it did.

With FALLOC_FL_PUNCH_HOLE you know that it is device offloaded.
We don't know that the device won't zero part or all of it, but
then again that was possible before when using BLKDISCARD as well.

Now after all that theory: is there a practial issue behind all
this?
Lukas Czerner Aug. 17, 2017, 1:35 p.m. UTC | #8
On Thu, Aug 17, 2017 at 11:52:42AM +0200, Christoph Hellwig wrote:
> On Thu, Aug 17, 2017 at 10:41:43AM +0200, Lukas Czerner wrote:
> > Ok, guarantee in this case is probably too strong expression. But you
> > know what I mean, saying "hey I am not using those blocks, feel free to
> > release them if you can". I did not encounter the need to be so strict
> > about discard, after all, it's for the device benefit.
> 
> That's what the BLKDISCARD ioctl, or fallocate with the
> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE
> flags will do fr your.

Sure, that's not the problem here.

> 
> > However there are device that do both (discard and zeroout) in discard
> > request and that's what I am after. To be able to identify that's the
> > case and take advantage. I think it's possible now with fallocate
> > FALLOC_FL_PUNCH_HOLE, but I do not know anything in advance and I can't
> > tell whether the device attempted to unmap as well even if it did.
> 
> With FALLOC_FL_PUNCH_HOLE you know that it is device offloaded.
> We don't know that the device won't zero part or all of it, but
> then again that was possible before when using BLKDISCARD as well.

Do we also know that the blocks were discarded as we do with BLKDISCARD
?

> 
> Now after all that theory: is there a practial issue behind all
> this?

As I mentioned before. We relied on discard_zeroes_data in mkfs.ext4 to
make sure that inode tables are zeroed after discard. Also RHV allows users
to set "Wipe After Delete" and "Enable Discard" on the VM disk if it
supported discard_zeroes_data. I am trying to find a way to replace this
functionality now that discard_zeroes_data always return 0.

-Lukas
Martin K. Petersen Aug. 17, 2017, 5:47 p.m. UTC | #9
Lukas,

> Do we also know that the blocks were discarded as we do with
> BLKDISCARD ?

There never was a way to know for sure.

ATA DSM TRIM and SCSI UNMAP are hints by definition. We attempted to
bend their semantics towards getting predictable behavior but ultimately
failed. Too many corner cases.

> As I mentioned before. We relied on discard_zeroes_data in mkfs.ext4
> to make sure that inode tables are zeroed after discard.

The point is that you shouldn't have an if (discard_zeroes_data)
conditional in the first place.

 - If you need to dellocate a block range and you don't care about its
   contents in the future, use BLKDISCARD / FL_PUNCH_HOLE.

 - If you need to zero a block range, use BLKZEROOUT / FL_ZERO_RANGE.

So the mkfs usage model should essentially be:

 - DISCARD the entire partition/device block range

 - ZEROOUT the inode tables and other areas where you need zeroes on
   future reads

And that should be the approach regardless of whether your device is a
disk drive, an SSD, an NVMe device a SCSI array or whatever. DISCARD old
contents, ZEROOUT the pieces you care about. No conditionals or trying
to do things differently based on device capabilities.
Lukas Czerner Aug. 17, 2017, 7:35 p.m. UTC | #10
On Thu, Aug 17, 2017 at 01:47:13PM -0400, Martin K. Petersen wrote:
> data in mkfs.ext4
> > to make sure that inode tables are zeroed after discard.
> 
> The point is that you shouldn't have an if (discard_zeroes_data)
> conditional in the first place.
> 
>  - If you need to dellocate a block range and you don't care about its
>    contents in the future, use BLKDISCARD / FL_PUNCH_HOLE.
> 
>  - If you need to zero a block range, use BLKZEROOUT / FL_ZERO_RANGE.
> 
> So the mkfs usage model should essentially be:
> 
>  - DISCARD the entire partition/device block range
> 
>  - ZEROOUT the inode tables and other areas where you need zeroes on
>    future reads

Yeah, that's what needs to be done generally, although as I said there
are devices where after DISCARD we will read zeroes. Comments in
__blkdev_issue_zeroout() even says it will unmap. In that case the
discard or zeroout will be redundant and it can take quite some time on
bigger devices.

-Lukas

> 
> And that should be the approach regardless of whether your device is a
> disk drive, an SSD, an NVMe device a SCSI array or whatever. DISCARD old
> contents, ZEROOUT the pieces you care about. No conditionals or trying
> to do things differently based on device capabilities.
> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering
Theodore Ts'o Aug. 17, 2017, 8:39 p.m. UTC | #11
On Thu, Aug 17, 2017 at 01:47:13PM -0400, Martin K. Petersen wrote:
> 
> Lukas,
> 
> > Do we also know that the blocks were discarded as we do with
> > BLKDISCARD ?
> 
> There never was a way to know for sure.
> 
> ATA DSM TRIM and SCSI UNMAP are hints by definition. We attempted to
> bend their semantics towards getting predictable behavior but ultimately
> failed. Too many corner cases.

I believe that eMMC and UFS does give you a way to get stronger
guarantees.

It's unfortunate that SATA did not get this right; it's even more
unfortunate that for many drives it works 99% of the time.  But yes,
it's not a guarantee.

						- Ted
diff mbox

Patch

diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index dea212d..6ea0d03 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -213,8 +213,15 @@  What:		/sys/block/<disk>/queue/discard_zeroes_data
 Date:		May 2011
 Contact:	Martin K. Petersen <martin.petersen@oracle.com>
 Description:
-		Will always return 0.  Don't rely on any specific behavior
-		for discards, and don't read this file.
+		Devices that support discard functionality may return
+		stale or random data when a previously discarded block
+		is read back. This can cause problems if the filesystem
+		expects discarded blocks to be explicitly cleared. If a
+		device reports that it deterministically returns zeroes
+		when a discarded area is read the discard_zeroes_data
+		parameter will be set to one. Otherwise it will be 0 and
+		the result of reading a discarded area is undefined.
+
 
 What:		/sys/block/<disk>/queue/write_same_max_bytes
 Date:		January 2012
diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index 2c1e670..b7f6bdc 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -43,6 +43,11 @@  large discards are issued, setting this value lower will make Linux issue
 smaller discards and potentially help reduce latencies induced by large
 discard operations.
 
+discard_zeroes_data (RO)
+------------------------
+When read, this file will show if the discarded block are zeroed by the
+device or not. If its value is '1' the blocks are zeroed otherwise not.
+
 hw_sector_size (RO)
 -------------------
 This is the hardware sector size of the device, in bytes.
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 27aceab..5b41ad0 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -209,7 +209,10 @@  static ssize_t queue_discard_max_store(struct request_queue *q,
 
 static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *page)
 {
-	return queue_var_show(0, page);
+	if (blk_queue_discard(q) && q->limits.max_write_zeroes_sectors)
+		return queue_var_show(1, page);
+	else
+		return queue_var_show(0, page);
 }
 
 static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
diff --git a/block/ioctl.c b/block/ioctl.c
index 0de02ee..faecd44 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -508,6 +508,7 @@  int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 	void __user *argp = (void __user *)arg;
 	loff_t size;
 	unsigned int max_sectors;
+	struct request_queue *q = bdev_get_queue(bdev);
 
 	switch (cmd) {
 	case BLKFLSBUF:
@@ -547,7 +548,10 @@  int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 	case BLKALIGNOFF:
 		return put_int(arg, bdev_alignment_offset(bdev));
 	case BLKDISCARDZEROES:
-		return put_uint(arg, 0);
+		if (blk_queue_discard(q) && q->limits.max_write_zeroes_sectors)
+			return put_uint(arg, 1);
+		else
+			return put_uint(arg, 0);
 	case BLKSECTGET:
 		max_sectors = min_t(unsigned int, USHRT_MAX,
 				    queue_max_sectors(bdev_get_queue(bdev)));