mbox series

[00/17] Improve qcow2 all-zero detection

Message ID 20200131174436.2961874-1-eblake@redhat.com (mailing list archive)
Headers show
Series Improve qcow2 all-zero detection | expand

Message

Eric Blake Jan. 31, 2020, 5:44 p.m. UTC
Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)

I'm working on adding an NBD extension that reports whether an image
is already all zero when the client first connects.  I initially
thought I could write the NBD code to just call bdrv_has_zero_init(),
but that turned out to be a bad assumption that instead resulted in
this patch series.  The NBD patch will come later (and cross-posted to
the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
repositories).

I do have an RFC question on patch 13 - as implemented here, I set a
qcow2 bit if the image has all clusters known zero and no backing
image.  But it may be more useful to instead report whether all
clusters _allocated in this layer_ are zero, at which point the
overall image is all-zero only if the backing file also has that
property (or even make it two bits).  The tweaks to subsequent patches
based on what we think makes the most useful semantics shouldn't be
hard.

[repo.or.cz appears to be down as I type this; I'll post a link to a
repository later when it comes back up]

Eric Blake (17):
  qcow2: Comment typo fixes
  qcow2: List autoclear bit names in header
  qcow2: Avoid feature name extension on small cluster size
  block: Improve documentation of .bdrv_has_zero_init
  block: Don't advertise zero_init_truncate with encryption
  block: Improve bdrv_has_zero_init_truncate with backing file
  gluster: Drop useless has_zero_init callback
  sheepdog: Consistently set bdrv_has_zero_init_truncate
  block: Refactor bdrv_has_zero_init{,_truncate}
  block: Add new BDRV_ZERO_OPEN flag
  file-posix: Support BDRV_ZERO_OPEN
  gluster: Support BDRV_ZERO_OPEN
  qcow2: Add new autoclear feature for all zero image
  qcow2: Expose all zero bit through .bdrv_known_zeroes
  qcow2: Implement all-zero autoclear bit
  iotests: Add new test for qcow2 all-zero bit
  qcow2: Let qemu-img check cover all-zero bit

 block.c                    |  62 +++++----
 block/file-posix.c         |  16 ++-
 block/file-win32.c         |   3 +-
 block/gluster.c            |  34 +++--
 block/nfs.c                |   7 +-
 block/parallels.c          |   4 +-
 block/qcow.c               |   2 +-
 block/qcow2-refcount.c     |  60 +++++++-
 block/qcow2-snapshot.c     |  11 ++
 block/qcow2.c              | 150 +++++++++++++++++---
 block/qcow2.h              |   6 +-
 block/qed.c                |   3 +-
 block/raw-format.c         |  12 +-
 block/rbd.c                |   3 +-
 block/sheepdog.c           |   7 +-
 block/ssh.c                |   7 +-
 block/vdi.c                |   8 +-
 block/vhdx.c               |  16 +--
 block/vmdk.c               |   9 +-
 block/vpc.c                |   8 +-
 blockdev.c                 |   2 +-
 docs/interop/qcow2.txt     |  15 +-
 include/block/block.h      |  38 ++++-
 include/block/block_int.h  |  14 +-
 qapi/block-core.json       |   4 +
 qemu-img.c                 |   9 +-
 tests/qemu-iotests/031.out |  14 +-
 tests/qemu-iotests/036     |   6 +-
 tests/qemu-iotests/036.out |  10 +-
 tests/qemu-iotests/060.out |   6 +-
 tests/qemu-iotests/061     |   6 +-
 tests/qemu-iotests/061.out |  26 ++--
 tests/qemu-iotests/065     |  12 +-
 tests/qemu-iotests/082.out |   7 +
 tests/qemu-iotests/122     |   2 +-
 tests/qemu-iotests/188     |   2 +-
 tests/qemu-iotests/188.out |   2 +-
 tests/qemu-iotests/206.out |   4 +
 tests/qemu-iotests/242.out |   1 +
 tests/qemu-iotests/285     | 124 +++++++++++++++++
 tests/qemu-iotests/285.out | 277 +++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/group   |   1 +
 42 files changed, 832 insertions(+), 178 deletions(-)
 create mode 100755 tests/qemu-iotests/285
 create mode 100644 tests/qemu-iotests/285.out

Comments

Max Reitz Feb. 4, 2020, 5:32 p.m. UTC | #1
On 31.01.20 18:44, Eric Blake wrote:
> Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
> ([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)
> 
> I'm working on adding an NBD extension that reports whether an image
> is already all zero when the client first connects.  I initially
> thought I could write the NBD code to just call bdrv_has_zero_init(),
> but that turned out to be a bad assumption that instead resulted in
> this patch series.  The NBD patch will come later (and cross-posted to
> the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
> repositories).

We had a discussion about this on IRC, and as far as I remember I wasn’t
quite sold on the “why”.  So, again, I wonder why this is needed.

I mean, it does make intuitive sense to want to know whether an image is
fully zero, but if I continue thinking about it I don’t know any case
where we would need to figure it out and where we could accept “We don’t
know” as an answer.  So I’m looking for use cases, but this cover letter
doesn’t mention any.  (And from a quick glance I don’t see this series
using the flag, actually.)

(We have a use case with convert -n to freshly created image files, but
my position on this on IRC was that we want the --target-is-zero flag
for that anyway: Auto-detection may always break, our preferred default
behavior may always change, so if you want convert -n not to touch the
target image except to write non-zero data from the source, we need a
--target-is-zero flag and users need to use it.  Well, management
layers, because I don’t think users would use convert -n anyway.

And with --target-is-zero and users effectively having to use it, I
don’t think that’s a good example of a use case.)

I suppose there is the point of blockdev-create + blockdev-mirror: This
has exactly the same problem as convert -n.  But again, if you really
want blockdev-mirror not just to force-zero the image, you probably need
to tell it so explicitly (i.e., with a --target-is-zero flag for
blockdev-mirror).

(Well, I suppose we could save us a target-is-zero for mirror if we took
this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
But, well, please no.)

But maybe I’m just an idiot and there is no reason not to take this
series and make blockdev-create + blockdev-mirror do the sensible thing
by default in most cases. *shrug*

Max
Eric Blake Feb. 4, 2020, 6:53 p.m. UTC | #2
On 2/4/20 11:32 AM, Max Reitz wrote:
> On 31.01.20 18:44, Eric Blake wrote:
>> Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
>> ([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)
>>
>> I'm working on adding an NBD extension that reports whether an image
>> is already all zero when the client first connects.  I initially
>> thought I could write the NBD code to just call bdrv_has_zero_init(),
>> but that turned out to be a bad assumption that instead resulted in
>> this patch series.  The NBD patch will come later (and cross-posted to
>> the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
>> repositories).
> 
> We had a discussion about this on IRC, and as far as I remember I wasn’t
> quite sold on the “why”.  So, again, I wonder why this is needed.
> 
> I mean, it does make intuitive sense to want to know whether an image is
> fully zero, but if I continue thinking about it I don’t know any case
> where we would need to figure it out and where we could accept “We don’t
> know” as an answer.  So I’m looking for use cases, but this cover letter
> doesn’t mention any.  (And from a quick glance I don’t see this series
> using the flag, actually.)

Patch 10/17 has:

diff --git a/qemu-img.c b/qemu-img.c
index e60217e6c382..c8519a74f738 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1985,10 +1985,12 @@ static int convert_do_copy(ImgConvertState *s)
      int64_t sector_num = 0;

      /* Check whether we have zero initialisation or can get it 
efficiently */
-    if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
-        !s->target_has_backing) {
-        s->has_zero_init = !!(bdrv_known_zeroes(blk_bs(s->target)) &
-                              BDRV_ZERO_CREATE);
+    if (!s->has_zero_init && s->min_sparse && !s->target_has_backing) {
+        ret = bdrv_known_zeroes(blk_bs(s->target));
+        if (ret & BDRV_ZERO_OPEN ||
+            (s->target_is_new && ret & BDRV_ZERO_CREATE)) {
+            s->has_zero_init = true;
+        }
      }

That's the use case: when copying into a destination file, it's useful 
to know if the destination already reads as all zeroes, before 
attempting a fallback to bdrv_make_zero(BDRV_REQ_NO_FALLBACK) or calls 
to block status checking for holes.

> 
> (We have a use case with convert -n to freshly created image files, but
> my position on this on IRC was that we want the --target-is-zero flag
> for that anyway: Auto-detection may always break, our preferred default
> behavior may always change, so if you want convert -n not to touch the
> target image except to write non-zero data from the source, we need a
> --target-is-zero flag and users need to use it.  Well, management
> layers, because I don’t think users would use convert -n anyway.
> 
> And with --target-is-zero and users effectively having to use it, I
> don’t think that’s a good example of a use case.)

Yes, there will still be cases where you have to use --target-is-zero 
because the image itself couldn't report that it already reads as 
zeroes, but there are also enough cases where the destination is already 
known to read zeroes and it's a shame to tell the user that 'you have to 
add --target-is-zero to get faster copying even though we could have 
inferred it on your behalf'.

> 
> I suppose there is the point of blockdev-create + blockdev-mirror: This
> has exactly the same problem as convert -n.  But again, if you really
> want blockdev-mirror not just to force-zero the image, you probably need
> to tell it so explicitly (i.e., with a --target-is-zero flag for
> blockdev-mirror).
> 
> (Well, I suppose we could save us a target-is-zero for mirror if we took
> this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
> But, well, please no.)
> 
> But maybe I’m just an idiot and there is no reason not to take this
> series and make blockdev-create + blockdev-mirror do the sensible thing
> by default in most cases. *shrug*

My argument for taking the series _is_ that the common case can be made 
more efficient without user effort.  Yes, we still need the knob for 
when the common case isn't already smart enough, but the difference in 
avoiding a pre-zeroing pass is noticeable when copying images around 
(and more than just for qcow2 - my followup series to improve NBD is 
similarly useful given how much work has already been invested in 
mapping NBD into storage access over https in the upper layers like ovirt).
Vladimir Sementsov-Ogievskiy Feb. 5, 2020, 9:04 a.m. UTC | #3
31.01.2020 20:44, Eric Blake wrote:
> Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
> ([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)
> 
> I'm working on adding an NBD extension that reports whether an image
> is already all zero when the client first connects.  I initially
> thought I could write the NBD code to just call bdrv_has_zero_init(),
> but that turned out to be a bad assumption that instead resulted in
> this patch series.  The NBD patch will come later (and cross-posted to
> the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
> repositories).
> 
> I do have an RFC question on patch 13 - as implemented here, I set a
> qcow2 bit if the image has all clusters known zero and no backing
> image.  But it may be more useful to instead report whether all
> clusters _allocated in this layer_ are zero, at which point the
> overall image is all-zero only if the backing file also has that
> property (or even make it two bits).  The tweaks to subsequent patches
> based on what we think makes the most useful semantics shouldn't be
> hard.
> 
> [repo.or.cz appears to be down as I type this; I'll post a link to a
> repository later when it comes back up]
> 

I have several ideas around it.

1. For generic block layer.
Did you consider as alternative to BDRV_ZEO_OPEN, to export the
information through normal block_status? So, if we have the
information, that disk is all-zero, we can always add _ZERO
flag to block-status result. And in generic bdrv_is_all_zeroes(),
we can just call block_status(0, disk_size), which will return
ZERO and n=disk_size if driver supports all-zero feature and is
all-zero now.
I think block-status is a native way for such information, and I
think that we anyway want to come to support of 64bit block-status
for qcow2 and nbd.

2. For NBD
Again, possible alternative is BLOCK_STATUS, but we need 64bit
commands for it. I plan to send a proposal anyway. Still, nothing
bad in two possible path of receiving all-zero information.
And even with your NBD extension, we can export this information
through block-status [1.]

3. For qcow2
Hmm. Here, as I understand, than main case is freshly created qcow2,
which is fully-unallocated. To understand that it is empty, we
need only to check all L1 entries. And for empty L1 table it is fast.
So we don't need any qcow2 format improvement to check it.
Vladimir Sementsov-Ogievskiy Feb. 5, 2020, 9:25 a.m. UTC | #4
05.02.2020 12:04, Vladimir Sementsov-Ogievskiy wrote:
> 31.01.2020 20:44, Eric Blake wrote:
>> Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
>> ([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)
>>
>> I'm working on adding an NBD extension that reports whether an image
>> is already all zero when the client first connects.  I initially
>> thought I could write the NBD code to just call bdrv_has_zero_init(),
>> but that turned out to be a bad assumption that instead resulted in
>> this patch series.  The NBD patch will come later (and cross-posted to
>> the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
>> repositories).
>>
>> I do have an RFC question on patch 13 - as implemented here, I set a
>> qcow2 bit if the image has all clusters known zero and no backing
>> image.  But it may be more useful to instead report whether all
>> clusters _allocated in this layer_ are zero, at which point the
>> overall image is all-zero only if the backing file also has that
>> property (or even make it two bits).  The tweaks to subsequent patches
>> based on what we think makes the most useful semantics shouldn't be
>> hard.
>>
>> [repo.or.cz appears to be down as I type this; I'll post a link to a
>> repository later when it comes back up]
>>
> 
> I have several ideas around it.
> 
> 1. For generic block layer.
> Did you consider as alternative to BDRV_ZEO_OPEN, to export the
> information through normal block_status? So, if we have the
> information, that disk is all-zero, we can always add _ZERO
> flag to block-status result. And in generic bdrv_is_all_zeroes(),
> we can just call block_status(0, disk_size), which will return
> ZERO and n=disk_size if driver supports all-zero feature and is
> all-zero now.
> I think block-status is a native way for such information, and I
> think that we anyway want to come to support of 64bit block-status
> for qcow2 and nbd.
> 
> 2. For NBD
> Again, possible alternative is BLOCK_STATUS, but we need 64bit
> commands for it. I plan to send a proposal anyway. Still, nothing
> bad in two possible path of receiving all-zero information.
> And even with your NBD extension, we can export this information
> through block-status [1.]
> 
> 3. For qcow2
> Hmm. Here, as I understand, than main case is freshly created qcow2,
> which is fully-unallocated. To understand that it is empty, we
> need only to check all L1 entries. And for empty L1 table it is fast.
> So we don't need any qcow2 format improvement to check it.
> 

Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
we have zero bits in L2 entries. And with them, we even don't need
preallocated to be filled by zeros, as we never read them (but just return
zeros on read)..

Then, may be we want similar flag for L1 entry (this will enable large
fast write-zero). And may be we want flag which marks the whole image
as read-zero (it's your flag). So, now I think, my previous idea
of "all allocated is zero" is worse. As for fully-preallocated images
we are sure that all clusters are allocated, and it is more native to
have flags similar to ZERO bit in L2 entry.
Eric Blake Feb. 5, 2020, 2:22 p.m. UTC | #5
On 2/5/20 3:04 AM, Vladimir Sementsov-Ogievskiy wrote:

>> [repo.or.cz appears to be down as I type this; I'll post a link to a
>> repository later when it comes back up]

Now up
https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/qcow2-all-zero-v1

>>
> 
> I have several ideas around it.
> 
> 1. For generic block layer.
> Did you consider as alternative to BDRV_ZEO_OPEN, to export the
> information through normal block_status? So, if we have the
> information, that disk is all-zero, we can always add _ZERO
> flag to block-status result.

Makes sense.

> And in generic bdrv_is_all_zeroes(),
> we can just call block_status(0, disk_size), which will return
> ZERO and n=disk_size if driver supports all-zero feature and is
> all-zero now.

Less obvious.  block_status is not required to visit the entire disk, 
even if the entire disk is all zero.  For example, qcow2 visits at most 
one L2 page in a call (if the request spans L1 entries, it will be 
truncated at the boundary, even if the region before and after the 
boundary have the same status).  I'm also worried if we still have 
32-bit limitations in block_status (ideally, we've fixed things to 
support 64-bit status where possible, but I'm not counting on it).

> I think block-status is a native way for such information, and I
> think that we anyway want to come to support of 64bit block-status
> for qcow2 and nbd.

Block status requires an O(n) loop over the disk, where n is the number 
of distinct extents possible.  If you get lucky, and 
block_status(0,size) returns a single extent, then yes that can feed the 
'is_zeroes' request.  Similarly, a single return of non-zero data can 
instantly tell you that 'is_zeroes' is false.  But given that drivers 
may break up their response on convenient boundaries, such as qcow2 on 
L1 entry granularity, you cannot blindly assume that a return of zero 
data for smaller than the requested size implies non-zero data, only 
that there is insufficient information to tell if the disk is all_zeroes 
without querying further block_status calls, and that's where you lose 
out on the speed compared to just being told up-front from an 'is_zero' 
call.

> 
> 2. For NBD
> Again, possible alternative is BLOCK_STATUS, but we need 64bit
> commands for it. I plan to send a proposal anyway. Still, nothing
> bad in two possible path of receiving all-zero information.
> And even with your NBD extension, we can export this information
> through block-status [1.]

Yes, having 64-bit BLOCK_STATUS in NBD is orthogonal to this, but both 
ideas are independently useful, and as the level of difficulty in 
implementing things may vary, it is conceivable to have both a server 
that provides 'is_zero' but not BLOCK_STATUS, and a server that provides 
64-bit BLOCK_STATUS but not 'is_zero'.

> 
> 3. For qcow2
> Hmm. Here, as I understand, than main case is freshly created qcow2,
> which is fully-unallocated. To understand that it is empty, we
> need only to check all L1 entries. And for empty L1 table it is fast.
> So we don't need any qcow2 format improvement to check it.

The benefit of this patch series is that it detects preallocated qcow2 
images as all_zero.  What's more, scanning all L1 entries is O(n), but 
detecting an autoclear all_zero bit is O(1).  Your proposed L1 scan is 
accurate for fewer cases, and costs more time.
Eric Blake Feb. 5, 2020, 2:26 p.m. UTC | #6
On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:

>> 3. For qcow2
>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>> which is fully-unallocated. To understand that it is empty, we
>> need only to check all L1 entries. And for empty L1 table it is fast.
>> So we don't need any qcow2 format improvement to check it.
>>
> 
> Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
> we have zero bits in L2 entries. And with them, we even don't need
> preallocated to be filled by zeros, as we never read them (but just return
> zeros on read)..

Scanning all L2 entries is O(n), while an autoclear bit properly 
maintained is O(1).

> 
> Then, may be we want similar flag for L1 entry (this will enable large
> fast write-zero). And may be we want flag which marks the whole image
> as read-zero (it's your flag). So, now I think, my previous idea
> of "all allocated is zero" is worse. As for fully-preallocated images
> we are sure that all clusters are allocated, and it is more native to
> have flags similar to ZERO bit in L2 entry.

Right now, we don't have any L1 entry flags.  Adding one would require 
adding an incompatible feature flag (if older qemu would choke to see 
unexpected flags in an L1 entry), or at best an autoclear feature flag 
(if the autoclear bit gets cleared because an older qemu opened the 
image and couldn't maintain L1 entry flags correctly, then newer qemu 
knows it cannot trust those L1 entry flags).  But as soon as you are 
talking about adding a feature bit, then why add one that still requires 
O(n) traversal to check (true, the 'n' in an O(n) traversal of L1 tables 
is much smaller than the 'n' in an O(n) traversal of L2 tables), when 
you can instead just add an O(1) autoclear bit that maintains all_zero 
status for the image as a whole?
Vladimir Sementsov-Ogievskiy Feb. 5, 2020, 2:43 p.m. UTC | #7
05.02.2020 17:22, Eric Blake wrote:
> On 2/5/20 3:04 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>> [repo.or.cz appears to be down as I type this; I'll post a link to a
>>> repository later when it comes back up]
> 
> Now up
> https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/qcow2-all-zero-v1
> 
>>>
>>
>> I have several ideas around it.
>>
>> 1. For generic block layer.
>> Did you consider as alternative to BDRV_ZEO_OPEN, to export the
>> information through normal block_status? So, if we have the
>> information, that disk is all-zero, we can always add _ZERO
>> flag to block-status result.
> 
> Makes sense.
> 
>> And in generic bdrv_is_all_zeroes(),
>> we can just call block_status(0, disk_size), which will return
>> ZERO and n=disk_size if driver supports all-zero feature and is
>> all-zero now.
> 
> Less obvious.  block_status is not required to visit the entire disk, even if the entire disk is all zero.  For example, qcow2 visits at most one L2 page in a call (if the request spans L1 entries, it will be truncated at the boundary, even if the region before and after the boundary have the same status).  I'm also worried if we still have 32-bit limitations in block_status (ideally, we've fixed things to support 64-bit status where possible, but I'm not counting on it).

Not required, but why not doing it? If we have information that all disk is of the same ZERO status, no reasons to not reply on block_status(0, disk_size) with smaller n.

> 
>> I think block-status is a native way for such information, and I
>> think that we anyway want to come to support of 64bit block-status
>> for qcow2 and nbd.
> 
> Block status requires an O(n) loop over the disk, where n is the number of distinct extents possible.  If you get lucky, and block_status(0,size) returns a single extent, then yes that can feed the 'is_zeroes' request.  Similarly, a single return of non-zero data can instantly tell you that 'is_zeroes' is false.  But given that drivers may break up their response on convenient boundaries, such as qcow2 on L1 entry granularity, you cannot blindly assume that a return of zero data for smaller than the requested size implies non-zero data, only that there is insufficient information to tell if the disk is all_zeroes without querying further block_status calls, and that's where you lose out on the speed compared to just being told up-front from an 'is_zero' call.

Yes. But how is it worse than BDRV_ZERO_OPEN? With one block_status call we have the same information. If on block_status(0, disk_size) driver replies with ZERO but smaller than disk_size, it means that either disk is not all-zero, or driver doesn't support 'fast whole-disk zero check' feature, which is equal to not supporting BDRV_ZERO_OPEN.

> 
>>
>> 2. For NBD
>> Again, possible alternative is BLOCK_STATUS, but we need 64bit
>> commands for it. I plan to send a proposal anyway. Still, nothing
>> bad in two possible path of receiving all-zero information.
>> And even with your NBD extension, we can export this information
>> through block-status [1.]
> 
> Yes, having 64-bit BLOCK_STATUS in NBD is orthogonal to this, but both ideas are independently useful, and as the level of difficulty in implementing things may vary, it is conceivable to have both a server that provides 'is_zero' but not BLOCK_STATUS, and a server that provides 64-bit BLOCK_STATUS but not 'is_zero'.
> 
>>
>> 3. For qcow2
>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>> which is fully-unallocated. To understand that it is empty, we
>> need only to check all L1 entries. And for empty L1 table it is fast.
>> So we don't need any qcow2 format improvement to check it.
> 
> The benefit of this patch series is that it detects preallocated qcow2 images as all_zero.  What's more, scanning all L1 entries is O(n), but detecting an autoclear all_zero bit is O(1).  Your proposed L1 scan is accurate for fewer cases, and costs more time.

Ah yes, somehow I thought that L1 is not allocated for fresh image..

Hmm, than possibly we need two new top-level flags: "all-zero" and "all-unallocated"..
Vladimir Sementsov-Ogievskiy Feb. 5, 2020, 2:47 p.m. UTC | #8
05.02.2020 17:26, Eric Blake wrote:
> On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>> 3. For qcow2
>>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>>> which is fully-unallocated. To understand that it is empty, we
>>> need only to check all L1 entries. And for empty L1 table it is fast.
>>> So we don't need any qcow2 format improvement to check it.
>>>
>>
>> Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
>> we have zero bits in L2 entries. And with them, we even don't need
>> preallocated to be filled by zeros, as we never read them (but just return
>> zeros on read)..
> 
> Scanning all L2 entries is O(n), while an autoclear bit properly maintained is O(1).
> 
>>
>> Then, may be we want similar flag for L1 entry (this will enable large
>> fast write-zero). And may be we want flag which marks the whole image
>> as read-zero (it's your flag). So, now I think, my previous idea
>> of "all allocated is zero" is worse. As for fully-preallocated images
>> we are sure that all clusters are allocated, and it is more native to
>> have flags similar to ZERO bit in L2 entry.
> 
> Right now, we don't have any L1 entry flags.  Adding one would require adding an incompatible feature flag (if older qemu would choke to see unexpected flags in an L1 entry), or at best an autoclear feature flag (if the autoclear bit gets cleared because an older qemu opened the image and couldn't maintain L1 entry flags correctly, then newer qemu knows it cannot trust those L1 entry flags).  But as soon as you are talking about adding a feature bit, then why add one that still requires O(n) traversal to check (true, the 'n' in an O(n) traversal of L1 tables is much smaller than the 'n' in an O(n) traversal of L2 tables), when you can instead just add an O(1) autoclear bit that maintains all_zero status for the image as a whole?
> 

My suggestion about L1 entry flag is side thing, I understand difference between O(n) and O(1) :) Still additional L1 entry will help to make efficient large block-status and write-zero requests.

And I agree that we need top level flag.. I just try to say, that it seems good to make it similar with existing L2 flag. But yes, it would be incomaptible change, as it marks all clusters as ZERO, and older Qemu can't understand it and may treat all clusters as unallocated.
Vladimir Sementsov-Ogievskiy Feb. 5, 2020, 2:58 p.m. UTC | #9
05.02.2020 17:43, Vladimir Sementsov-Ogievskiy wrote:
> 05.02.2020 17:22, Eric Blake wrote:
>> On 2/5/20 3:04 AM, Vladimir Sementsov-Ogievskiy wrote:
>>
>>>> [repo.or.cz appears to be down as I type this; I'll post a link to a
>>>> repository later when it comes back up]
>>
>> Now up
>> https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/qcow2-all-zero-v1
>>
>>>>
>>>
>>> I have several ideas around it.
>>>
>>> 1. For generic block layer.
>>> Did you consider as alternative to BDRV_ZEO_OPEN, to export the
>>> information through normal block_status? So, if we have the
>>> information, that disk is all-zero, we can always add _ZERO
>>> flag to block-status result.
>>
>> Makes sense.
>>
>>> And in generic bdrv_is_all_zeroes(),
>>> we can just call block_status(0, disk_size), which will return
>>> ZERO and n=disk_size if driver supports all-zero feature and is
>>> all-zero now.
>>
>> Less obvious.  block_status is not required to visit the entire disk, even if the entire disk is all zero.  For example, qcow2 visits at most one L2 page in a call (if the request spans L1 entries, it will be truncated at the boundary, even if the region before and after the boundary have the same status).  I'm also worried if we still have 32-bit limitations in block_status (ideally, we've fixed things to support 64-bit status where possible, but I'm not counting on it).
> 
> Not required, but why not doing it? If we have information that all disk is of the same ZERO status, no reasons to not reply on block_status(0, disk_size) with smaller n.
> 
>>
>>> I think block-status is a native way for such information, and I
>>> think that we anyway want to come to support of 64bit block-status
>>> for qcow2 and nbd.
>>
>> Block status requires an O(n) loop over the disk, where n is the number of distinct extents possible.  If you get lucky, and block_status(0,size) returns a single extent, then yes that can feed the 'is_zeroes' request.  Similarly, a single return of non-zero data can instantly tell you that 'is_zeroes' is false.  But given that drivers may break up their response on convenient boundaries, such as qcow2 on L1 entry granularity, you cannot blindly assume that a return of zero data for smaller than the requested size implies non-zero data, only that there is insufficient information to tell if the disk is all_zeroes without querying further block_status calls, and that's where you lose out on the speed compared to just being told up-front from an 'is_zero' call.
> 
> Yes. But how is it worse than BDRV_ZERO_OPEN? With one block_status call we have the same information. If on block_status(0, disk_size) driver replies with ZERO but smaller than disk_size, it means that either disk is not all-zero, or driver doesn't support 'fast whole-disk zero check' feature, which is equal to not supporting BDRV_ZERO_OPEN.
> 
>>
>>>
>>> 2. For NBD
>>> Again, possible alternative is BLOCK_STATUS, but we need 64bit
>>> commands for it. I plan to send a proposal anyway. Still, nothing
>>> bad in two possible path of receiving all-zero information.
>>> And even with your NBD extension, we can export this information
>>> through block-status [1.]
>>
>> Yes, having 64-bit BLOCK_STATUS in NBD is orthogonal to this, but both ideas are independently useful, and as the level of difficulty in implementing things may vary, it is conceivable to have both a server that provides 'is_zero' but not BLOCK_STATUS, and a server that provides 64-bit BLOCK_STATUS but not 'is_zero'.
>>
>>>
>>> 3. For qcow2
>>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>>> which is fully-unallocated. To understand that it is empty, we
>>> need only to check all L1 entries. And for empty L1 table it is fast.
>>> So we don't need any qcow2 format improvement to check it.
>>
>> The benefit of this patch series is that it detects preallocated qcow2 images as all_zero.  What's more, scanning all L1 entries is O(n), but detecting an autoclear all_zero bit is O(1).  Your proposed L1 scan is accurate for fewer cases, and costs more time.
> 
> Ah yes, somehow I thought that L1 is not allocated for fresh image..
> 
> Hmm, than possibly we need two new top-level flags: "all-zero" and "all-unallocated"..
> 

It make sense only with incompatible semantics. With autoclean semantics it's better to have one 'all-allocated-are-zero' and don't care.
Vladimir Sementsov-Ogievskiy Feb. 5, 2020, 3:14 p.m. UTC | #10
05.02.2020 17:47, Vladimir Sementsov-Ogievskiy wrote:
> 05.02.2020 17:26, Eric Blake wrote:
>> On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:
>>
>>>> 3. For qcow2
>>>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>>>> which is fully-unallocated. To understand that it is empty, we
>>>> need only to check all L1 entries. And for empty L1 table it is fast.
>>>> So we don't need any qcow2 format improvement to check it.
>>>>
>>>
>>> Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
>>> we have zero bits in L2 entries. And with them, we even don't need
>>> preallocated to be filled by zeros, as we never read them (but just return
>>> zeros on read)..
>>
>> Scanning all L2 entries is O(n), while an autoclear bit properly maintained is O(1).
>>
>>>
>>> Then, may be we want similar flag for L1 entry (this will enable large
>>> fast write-zero). And may be we want flag which marks the whole image
>>> as read-zero (it's your flag). So, now I think, my previous idea
>>> of "all allocated is zero" is worse. As for fully-preallocated images
>>> we are sure that all clusters are allocated, and it is more native to
>>> have flags similar to ZERO bit in L2 entry.
>>
>> Right now, we don't have any L1 entry flags.  Adding one would require adding an incompatible feature flag (if older qemu would choke to see unexpected flags in an L1 entry), or at best an autoclear feature flag (if the autoclear bit gets cleared because an older qemu opened the image and couldn't maintain L1 entry flags correctly, then newer qemu knows it cannot trust those L1 entry flags).  But as soon as you are talking about adding a feature bit, then why add one that still requires O(n) traversal to check (true, the 'n' in an O(n) traversal of L1 tables is much smaller than the 'n' in an O(n) traversal of L2 tables), when you can instead just add an O(1) autoclear bit that maintains all_zero status for the image as a whole?
>>
> 
> My suggestion about L1 entry flag is side thing, I understand difference between O(n) and O(1) :) Still additional L1 entry will help to make efficient large block-status and write-zero requests.
> 
> And I agree that we need top level flag.. I just try to say, that it seems good to make it similar with existing L2 flag. But yes, it would be incomaptible change, as it marks all clusters as ZERO, and older Qemu can't understand it and may treat all clusters as unallocated.
> 

Still, how long is this O(n) ? We load the whole L1 into memory anyway. For example, 16Tb disk with 64K granularity, we'll have 32768 L1 entries. Will we get sensible performance benefit with an extension? I doubt in it now. And anyway, if we have an extension, we should fallback to this O(n) if we don't have the flag set.

So, I think the flag is beneficial only for preallocated images.

Hmm. and for such images, if we want, we can define this flag as 'all clusters are allocated zeroes', if we want. Which will prove that image reads as zero independently of any backing relations.
Max Reitz Feb. 5, 2020, 5:04 p.m. UTC | #11
On 04.02.20 19:53, Eric Blake wrote:
> On 2/4/20 11:32 AM, Max Reitz wrote:
>> On 31.01.20 18:44, Eric Blake wrote:
>>> Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
>>> ([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)
>>>
>>> I'm working on adding an NBD extension that reports whether an image
>>> is already all zero when the client first connects.  I initially
>>> thought I could write the NBD code to just call bdrv_has_zero_init(),
>>> but that turned out to be a bad assumption that instead resulted in
>>> this patch series.  The NBD patch will come later (and cross-posted to
>>> the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
>>> repositories).
>>
>> We had a discussion about this on IRC, and as far as I remember I wasn’t
>> quite sold on the “why”.  So, again, I wonder why this is needed.
>>
>> I mean, it does make intuitive sense to want to know whether an image is
>> fully zero, but if I continue thinking about it I don’t know any case
>> where we would need to figure it out and where we could accept “We don’t
>> know” as an answer.  So I’m looking for use cases, but this cover letter
>> doesn’t mention any.  (And from a quick glance I don’t see this series
>> using the flag, actually.)
> 
> Patch 10/17 has:
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index e60217e6c382..c8519a74f738 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1985,10 +1985,12 @@ static int convert_do_copy(ImgConvertState *s)
>      int64_t sector_num = 0;
> 
>      /* Check whether we have zero initialisation or can get it
> efficiently */
> -    if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
> -        !s->target_has_backing) {
> -        s->has_zero_init = !!(bdrv_known_zeroes(blk_bs(s->target)) &
> -                              BDRV_ZERO_CREATE);
> +    if (!s->has_zero_init && s->min_sparse && !s->target_has_backing) {
> +        ret = bdrv_known_zeroes(blk_bs(s->target));
> +        if (ret & BDRV_ZERO_OPEN ||
> +            (s->target_is_new && ret & BDRV_ZERO_CREATE)) {
> +            s->has_zero_init = true;
> +        }
>      }

OK, I expected users to come in a separate patch.

> That's the use case: when copying into a destination file, it's useful
> to know if the destination already reads as all zeroes, before
> attempting a fallback to bdrv_make_zero(BDRV_REQ_NO_FALLBACK) or calls
> to block status checking for holes.

But that was my point on IRC.  Is it really more useful if
bdrv_make_zero() is just as quick?  (And the fact that NBD doesn’t have
an implementation looks more like a problem with NBD to me.)

(Considering that at least the code we discussed on IRC didn’t work for
preallocated images, which was the one point where we actually have a
problem in practice.)

>> (We have a use case with convert -n to freshly created image files, but
>> my position on this on IRC was that we want the --target-is-zero flag
>> for that anyway: Auto-detection may always break, our preferred default
>> behavior may always change, so if you want convert -n not to touch the
>> target image except to write non-zero data from the source, we need a
>> --target-is-zero flag and users need to use it.  Well, management
>> layers, because I don’t think users would use convert -n anyway.
>>
>> And with --target-is-zero and users effectively having to use it, I
>> don’t think that’s a good example of a use case.)
> 
> Yes, there will still be cases where you have to use --target-is-zero
> because the image itself couldn't report that it already reads as
> zeroes, but there are also enough cases where the destination is already
> known to read zeroes and it's a shame to tell the user that 'you have to
> add --target-is-zero to get faster copying even though we could have
> inferred it on your behalf'.

How is it a shame?  I think only management tools would use convert -n.
 Management tools want reliable behavior.  If you want reliable
behavior, you have to use --target-is-zero anyway.  So I don’t see the
actual benefit for qemu-img convert.

>> I suppose there is the point of blockdev-create + blockdev-mirror: This
>> has exactly the same problem as convert -n.  But again, if you really
>> want blockdev-mirror not just to force-zero the image, you probably need
>> to tell it so explicitly (i.e., with a --target-is-zero flag for
>> blockdev-mirror).
>>
>> (Well, I suppose we could save us a target-is-zero for mirror if we took
>> this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
>> But, well, please no.)
>>
>> But maybe I’m just an idiot and there is no reason not to take this
>> series and make blockdev-create + blockdev-mirror do the sensible thing
>> by default in most cases. *shrug*
> 
> My argument for taking the series _is_ that the common case can be made
> more efficient without user effort.

The thing is, I don’t see the user effort.  I don’t think users use
convert -n or backup manually.  And for management tools, it isn’t
really effort to add another switch.

> Yes, we still need the knob for
> when the common case isn't already smart enough,

But the user can’t know when qemu isn’t smart enough.  So users who care
have to always give the flag.

> but the difference in
> avoiding a pre-zeroing pass is noticeable when copying images around

I’m sure it is, but the question I ask is whether in practice we
wouldn’t get --target-is-zero in all of these cases anyway.


So I’m not sold on “it works most of the time”, because if it’s just
most of the time, then we’ll likely see --target-is-zero all of the time.

OTOH, I suppose that with the new qcow2 extension, it would always work
for the following case:
(1) Create a qcow2 file,
(2) Immediately (with the next qemu-img/QMP invocation) use it as a
target of convert -n or mirror or anything similar.

If so, that means it works reliably all of the time for a common case.
I guess that’d be enough for me.

Max

> (and more than just for qcow2 - my followup series to improve NBD is
> similarly useful given how much work has already been invested in
> mapping NBD into storage access over https in the upper layers like ovirt).
>
Max Reitz Feb. 5, 2020, 5:58 p.m. UTC | #12
On 05.02.20 16:14, Vladimir Sementsov-Ogievskiy wrote:
> 05.02.2020 17:47, Vladimir Sementsov-Ogievskiy wrote:
>> 05.02.2020 17:26, Eric Blake wrote:
>>> On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>
>>>>> 3. For qcow2
>>>>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>>>>> which is fully-unallocated. To understand that it is empty, we
>>>>> need only to check all L1 entries. And for empty L1 table it is fast.
>>>>> So we don't need any qcow2 format improvement to check it.
>>>>>
>>>>
>>>> Ah yes, I forget about preallocated case. Hmm. For preallocated
>>>> clusters,
>>>> we have zero bits in L2 entries. And with them, we even don't need
>>>> preallocated to be filled by zeros, as we never read them (but just
>>>> return
>>>> zeros on read)..
>>>
>>> Scanning all L2 entries is O(n), while an autoclear bit properly
>>> maintained is O(1).
>>>
>>>>
>>>> Then, may be we want similar flag for L1 entry (this will enable large
>>>> fast write-zero). And may be we want flag which marks the whole image
>>>> as read-zero (it's your flag). So, now I think, my previous idea
>>>> of "all allocated is zero" is worse. As for fully-preallocated images
>>>> we are sure that all clusters are allocated, and it is more native to
>>>> have flags similar to ZERO bit in L2 entry.
>>>
>>> Right now, we don't have any L1 entry flags.  Adding one would
>>> require adding an incompatible feature flag (if older qemu would
>>> choke to see unexpected flags in an L1 entry), or at best an
>>> autoclear feature flag (if the autoclear bit gets cleared because an
>>> older qemu opened the image and couldn't maintain L1 entry flags
>>> correctly, then newer qemu knows it cannot trust those L1 entry
>>> flags).  But as soon as you are talking about adding a feature bit,
>>> then why add one that still requires O(n) traversal to check (true,
>>> the 'n' in an O(n) traversal of L1 tables is much smaller than the
>>> 'n' in an O(n) traversal of L2 tables), when you can instead just add
>>> an O(1) autoclear bit that maintains all_zero status for the image as
>>> a whole?
>>>
>>
>> My suggestion about L1 entry flag is side thing, I understand
>> difference between O(n) and O(1) :) Still additional L1 entry will
>> help to make efficient large block-status and write-zero requests.
>>
>> And I agree that we need top level flag.. I just try to say, that it
>> seems good to make it similar with existing L2 flag. But yes, it would
>> be incomaptible change, as it marks all clusters as ZERO, and older
>> Qemu can't understand it and may treat all clusters as unallocated.
>>
> 
> Still, how long is this O(n) ? We load the whole L1 into memory anyway.
> For example, 16Tb disk with 64K granularity, we'll have 32768 L1
> entries. Will we get sensible performance benefit with an extension? I
> doubt in it now. And anyway, if we have an extension, we should fallback
> to this O(n) if we don't have the flag set.

(Sorry, it’s late and I haven’t followed this particular conversation
too closely, but:)

Keep in mind that the default metadata overlap protection mode causes
all L1 entries to be scanned on each I/O write.  So it can’t be that bad.

Max
Eric Blake Feb. 5, 2020, 7:21 p.m. UTC | #13
On 2/5/20 11:04 AM, Max Reitz wrote:
> OK, I expected users to come in a separate patch.

I can refactor that better in v2.

> 
>> That's the use case: when copying into a destination file, it's useful
>> to know if the destination already reads as all zeroes, before
>> attempting a fallback to bdrv_make_zero(BDRV_REQ_NO_FALLBACK) or calls
>> to block status checking for holes.
> 
> But that was my point on IRC.  Is it really more useful if
> bdrv_make_zero() is just as quick?  (And the fact that NBD doesn’t have
> an implementation looks more like a problem with NBD to me.)

That is indeed a thought - why should qemu-img even TRY to call 
bdrv_has_init_zero; it should instead call bdrv_make_zero(), which 
should be as fast as possible (see my latest reply on 9/17 exploring 
that idea more).  Under the hood, we can then make bdrv_make_zero() use 
whatever tricks it needs, whether keeping the driver's 
.bdrv_has_zero_init/_truncate callbacks, adding another callback, making 
write_zeroes faster, or whatever, but instead of making qemu-img sort 
through multiple ideas, the burden would now be hidden in the block layer.

> 
> (Considering that at least the code we discussed on IRC didn’t work for
> preallocated images, which was the one point where we actually have a
> problem in practice.)

And this series DOES improve the case for preallocated qcow2 images, by 
virtue of a new qcow2 autoclear bit.  But again, that may be something 
we want to hide in the driver callback interfaces, while qemu-img just 
blindly calls bdrv_make_zero() and gets success (the image now reads as 
zeroes, either because it was already that way or we did something 
quick) or failure (it is a waste of time to prezero, whether by 
write_zeroes or by trim or by truncate, so just manually write zeroes as 
part of your image copying).

> 
>>> (We have a use case with convert -n to freshly created image files, but
>>> my position on this on IRC was that we want the --target-is-zero flag
>>> for that anyway: Auto-detection may always break, our preferred default
>>> behavior may always change, so if you want convert -n not to touch the
>>> target image except to write non-zero data from the source, we need a
>>> --target-is-zero flag and users need to use it.  Well, management
>>> layers, because I don’t think users would use convert -n anyway.
>>>
>>> And with --target-is-zero and users effectively having to use it, I
>>> don’t think that’s a good example of a use case.)
>>
>> Yes, there will still be cases where you have to use --target-is-zero
>> because the image itself couldn't report that it already reads as
>> zeroes, but there are also enough cases where the destination is already
>> known to read zeroes and it's a shame to tell the user that 'you have to
>> add --target-is-zero to get faster copying even though we could have
>> inferred it on your behalf'.
> 
> How is it a shame?  I think only management tools would use convert -n.
>   Management tools want reliable behavior.  If you want reliable
> behavior, you have to use --target-is-zero anyway.  So I don’t see the
> actual benefit for qemu-img convert.

qemu-img convert to an NBD destination cannot create the destination, so 
it ALWAYS has to use -n.  I don't know how often users are likely to 
wire up a command line for qemu-img convert with NBD as the destination, 
or whether you are correct that it will be a management app able to 
supply -n (and thus able to supply --target-is-zero).  But at the same 
time, can a management app learn whether it is safe to supply 
--target-is-zero?  With my upcoming NBD patches, 'qemu-img --list' will 
show whether the NBD target is known to start life all zero, and a 
management app could use mechanism to decide when to pass 
--target-is-zero (whether a management app would actually fork qemu-img 
--list, or just be an NBD client itself to do the same thing qemu-img 
would do, is beside the point).

Similarly, this series includes enhancements to 'qemu-img info' on qcow2 
images known to currently read as zero; again, that sort of information 
is necessary somewhere in the chain, whether it be because qemu-img 
consumes the information itself, or because the management app consumes 
the information in order to pass the --target-is-zero option to 
qemu-img, either way, the information needs to be available for consumption.

> 
>>> I suppose there is the point of blockdev-create + blockdev-mirror: This
>>> has exactly the same problem as convert -n.  But again, if you really
>>> want blockdev-mirror not just to force-zero the image, you probably need
>>> to tell it so explicitly (i.e., with a --target-is-zero flag for
>>> blockdev-mirror).
>>>
>>> (Well, I suppose we could save us a target-is-zero for mirror if we took
>>> this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
>>> But, well, please no.)
>>>
>>> But maybe I’m just an idiot and there is no reason not to take this
>>> series and make blockdev-create + blockdev-mirror do the sensible thing
>>> by default in most cases. *shrug*
>>
>> My argument for taking the series _is_ that the common case can be made
>> more efficient without user effort.
> 
> The thing is, I don’t see the user effort.  I don’t think users use
> convert -n or backup manually.  And for management tools, it isn’t
> really effort to add another switch.

Maybe, but it is just shifting the burden between who consumes the 
information that an image is all zero, and how the consumption of that 
information is passed to qemu-img.

> 
>> Yes, we still need the knob for
>> when the common case isn't already smart enough,
> 
> But the user can’t know when qemu isn’t smart enough.  So users who care
> have to always give the flag.
> 
>> but the difference in
>> avoiding a pre-zeroing pass is noticeable when copying images around
> 
> I’m sure it is, but the question I ask is whether in practice we
> wouldn’t get --target-is-zero in all of these cases anyway.
> 
> 
> So I’m not sold on “it works most of the time”, because if it’s just
> most of the time, then we’ll likely see --target-is-zero all of the time.
> 
> OTOH, I suppose that with the new qcow2 extension, it would always work
> for the following case:
> (1) Create a qcow2 file,
> (2) Immediately (with the next qemu-img/QMP invocation) use it as a
> target of convert -n or mirror or anything similar.

Yes, that is one of the immediately obvious fallouts from this series - 
you can now create a preallocated qcow2 image in one process, and the 
next process using that image can readily tell that it is still 
just-created.

> 
> If so, that means it works reliably all of the time for a common case.
> I guess that’d be enough for me.
> 
> Max
> 
>> (and more than just for qcow2 - my followup series to improve NBD is
>> similarly useful given how much work has already been invested in
>> mapping NBD into storage access over https in the upper layers like ovirt).
>>
> 
> 

At any rate, I think you've convinced me to rethink how I present v2 
(maybe not by refactoring bdrv_known_zeroes usage, but instead 
refactoring bdrv_make_zero), but that the qcow2 autoclear bit is still a 
useful feature to have.
Max Reitz Feb. 6, 2020, 9:12 a.m. UTC | #14
On 05.02.20 20:21, Eric Blake wrote:
> On 2/5/20 11:04 AM, Max Reitz wrote:
>> OK, I expected users to come in a separate patch.
> 
> I can refactor that better in v2.
> 
>>
>>> That's the use case: when copying into a destination file, it's useful
>>> to know if the destination already reads as all zeroes, before
>>> attempting a fallback to bdrv_make_zero(BDRV_REQ_NO_FALLBACK) or calls
>>> to block status checking for holes.
>>
>> But that was my point on IRC.  Is it really more useful if
>> bdrv_make_zero() is just as quick?  (And the fact that NBD doesn’t have
>> an implementation looks more like a problem with NBD to me.)
> 
> That is indeed a thought - why should qemu-img even TRY to call
> bdrv_has_init_zero; it should instead call bdrv_make_zero(), which
> should be as fast as possible (see my latest reply on 9/17 exploring
> that idea more).  Under the hood, we can then make bdrv_make_zero() use
> whatever tricks it needs, whether keeping the driver's
> .bdrv_has_zero_init/_truncate callbacks, adding another callback, making
> write_zeroes faster, or whatever, but instead of making qemu-img sort
> through multiple ideas, the burden would now be hidden in the block layer.

I didn’t even think of that.  More on that at the very bottom of this mail.

>> (Considering that at least the code we discussed on IRC didn’t work for
>> preallocated images, which was the one point where we actually have a
>> problem in practice.)
> 
> And this series DOES improve the case for preallocated qcow2 images, by
> virtue of a new qcow2 autoclear bit.  But again, that may be something
> we want to hide in the driver callback interfaces, while qemu-img just
> blindly calls bdrv_make_zero() and gets success (the image now reads as
> zeroes, either because it was already that way or we did something
> quick) or failure (it is a waste of time to prezero, whether by
> write_zeroes or by trim or by truncate, so just manually write zeroes as
> part of your image copying).

Oh, yes, indeed.  Sorry.

>>>> (We have a use case with convert -n to freshly created image files, but
>>>> my position on this on IRC was that we want the --target-is-zero flag
>>>> for that anyway: Auto-detection may always break, our preferred default
>>>> behavior may always change, so if you want convert -n not to touch the
>>>> target image except to write non-zero data from the source, we need a
>>>> --target-is-zero flag and users need to use it.  Well, management
>>>> layers, because I don’t think users would use convert -n anyway.
>>>>
>>>> And with --target-is-zero and users effectively having to use it, I
>>>> don’t think that’s a good example of a use case.)
>>>
>>> Yes, there will still be cases where you have to use --target-is-zero
>>> because the image itself couldn't report that it already reads as
>>> zeroes, but there are also enough cases where the destination is already
>>> known to read zeroes and it's a shame to tell the user that 'you have to
>>> add --target-is-zero to get faster copying even though we could have
>>> inferred it on your behalf'.
>>
>> How is it a shame?  I think only management tools would use convert -n.
>>   Management tools want reliable behavior.  If you want reliable
>> behavior, you have to use --target-is-zero anyway.  So I don’t see the
>> actual benefit for qemu-img convert.
> 
> qemu-img convert to an NBD destination cannot create the destination, so
> it ALWAYS has to use -n.  I don't know how often users are likely to
> wire up a command line for qemu-img convert with NBD as the destination,
> or whether you are correct that it will be a management app able to
> supply -n (and thus able to supply --target-is-zero).  But at the same
> time, can a management app learn whether it is safe to supply
> --target-is-zero?  With my upcoming NBD patches, 'qemu-img --list' will
> show whether the NBD target is known to start life all zero, and a
> management app could use mechanism to decide when to pass
> --target-is-zero (whether a management app would actually fork qemu-img
> --list, or just be an NBD client itself to do the same thing qemu-img
> would do, is beside the point).
> 
> Similarly, this series includes enhancements to 'qemu-img info' on qcow2
> images known to currently read as zero; again, that sort of information
> is necessary somewhere in the chain, whether it be because qemu-img
> consumes the information itself, or because the management app consumes
> the information in order to pass the --target-is-zero option to
> qemu-img, either way, the information needs to be available for
> consumption.

I simply assumed that management applications will just assume that a
newly created image is zero.

I’m aware that may be wrong, but then again, that hasn’t stopped them in
the past or they would have asked for qemu to deliver this information
earlier...  (That doesn’t mean that at some point maybe they will start
to care and ask for it.)

One problem with delivering this information of course is that it’s
useless.  If qemu knows the image to be zero, then it will do the right
thing by itself, and then the management application doesn’t need to
pass --target-is-zero anymore.

>>>> I suppose there is the point of blockdev-create + blockdev-mirror: This
>>>> has exactly the same problem as convert -n.  But again, if you really
>>>> want blockdev-mirror not just to force-zero the image, you probably
>>>> need
>>>> to tell it so explicitly (i.e., with a --target-is-zero flag for
>>>> blockdev-mirror).
>>>>
>>>> (Well, I suppose we could save us a target-is-zero for mirror if we
>>>> took
>>>> this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
>>>> But, well, please no.)
>>>>
>>>> But maybe I’m just an idiot and there is no reason not to take this
>>>> series and make blockdev-create + blockdev-mirror do the sensible thing
>>>> by default in most cases. *shrug*
>>>
>>> My argument for taking the series _is_ that the common case can be made
>>> more efficient without user effort.
>>
>> The thing is, I don’t see the user effort.  I don’t think users use
>> convert -n or backup manually.  And for management tools, it isn’t
>> really effort to add another switch.
> 
> Maybe, but it is just shifting the burden between who consumes the
> information that an image is all zero, and how the consumption of that
> information is passed to qemu-img.

Sure, but the question is who can take the burden the easiest.

The management layer creates the image and then uses it, so it can
easily retain this information.

When qemu creates an image and then uses it in a separate step, it
cannot retain this information.  It has to be stored somewhere
persistently and we have to fetch it.  In the case of qcow2, that works
with a flag.  In other cases...  Well, in any case it isn’t as trivial
as in a management application.

>>> Yes, we still need the knob for
>>> when the common case isn't already smart enough,
>>
>> But the user can’t know when qemu isn’t smart enough.  So users who care
>> have to always give the flag.
>>
>>> but the difference in
>>> avoiding a pre-zeroing pass is noticeable when copying images around
>>
>> I’m sure it is, but the question I ask is whether in practice we
>> wouldn’t get --target-is-zero in all of these cases anyway.
>>
>>
>> So I’m not sold on “it works most of the time”, because if it’s just
>> most of the time, then we’ll likely see --target-is-zero all of the time.
>>
>> OTOH, I suppose that with the new qcow2 extension, it would always work
>> for the following case:
>> (1) Create a qcow2 file,
>> (2) Immediately (with the next qemu-img/QMP invocation) use it as a
>> target of convert -n or mirror or anything similar.
> 
> Yes, that is one of the immediately obvious fallouts from this series -
> you can now create a preallocated qcow2 image in one process, and the
> next process using that image can readily tell that it is still
> just-created.

And it’s a common case with blockdev-create.

>> If so, that means it works reliably all of the time for a common case.
>> I guess that’d be enough for me.
>>
>> Max
>>
>>> (and more than just for qcow2 - my followup series to improve NBD is
>>> similarly useful given how much work has already been invested in
>>> mapping NBD into storage access over https in the upper layers like
>>> ovirt).
>>>
>>
>>
> 
> At any rate, I think you've convinced me to rethink how I present v2
> (maybe not by refactoring bdrv_known_zeroes usage, but instead
> refactoring bdrv_make_zero), but that the qcow2 autoclear bit is still a
> useful feature to have.

Hm.  So you mean there isn’t any caller that actually cares about
whether an image is zero, only that it is zero.  That is, they are all
“if (!is_zero()) { make_zero(); }” – which is functionally the same as
“make_zero();” alone.  make_zero in turn could and should be a no-op
when the image is known to be zero already.

I actually didn’t think of that.  Sounds good.

Max