diff mbox series

qemu-{img,nbd}: Don't report zeroed cluster as a hole

Message ID 20210607202204.1805199-1-nsoffer@redhat.com (mailing list archive)
State New, archived
Headers show
Series qemu-{img,nbd}: Don't report zeroed cluster as a hole | expand

Commit Message

Nir Soffer June 7, 2021, 8:22 p.m. UTC
When zeroing a cluster in an image with backing file, qemu-img and
qemu-nbd reported the area as a hole. This does not affect the guest
since the area is read as zero, but breaks code trying to reconstruct
the image chain based on qemu-img map or qemu-nbd block status response.

Here is simpler reproducer:

    # Create a qcow2 image with a raw backing file:
    $ qemu-img create base.raw $((4*64*1024))
    $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2

    # Write to first 3 clusters of base:
    $ qemu-io -f raw -c "write -P 65 0 64k" base.raw
    $ qemu-io -f raw -c "write -P 66 64k 64k" base.raw
    $ qemu-io -f raw -c "write -P 67 128k 64k" base.raw

    # Write to second cluster of top, hiding second cluster of base:
    $ qemu-io -f qcow2 -c "write -P 69 64k 64k" top.qcow2

    # Write zeroes to third cluster of top, hiding third cluster of base:
    $ qemu-io -f qcow2 -c "write -z 128k 64k" top.qcow2

This creates:

    top:  -D0-
    base: ABC-

How current qemu-img and qemu-nbd report the state:

    $ qemu-img map --output json top.qcow2
    [{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, "offset": 0},
    { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
    { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
    { "start": 196608, "length": 65536, "depth": 1, "zero": true, "data": false, "offset": 196608}]

    $ qemu-nbd -r -t -f qcow2 top.qcow2 &
    $ qemu-img map --output json nbd://localhost
    [{ "start": 0, "length": 131072, "depth": 0, "zero": false, "data": true, "offset": 0},
    { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false, "offset": 131072}]

    $ nbdinfo --map nbd://localhost
             0      131072    0  allocated
        131072      131072    3  hole,zero

The third extents is reported as a hole in both cases. In qmeu-nbd the
cluster is merged with forth cluster which is actually a hole.

This is incorrect since if it was a hole, the third cluster would be
exposed to the guest. Programs using qemu-nbd output to reconstruct the
image chain on other storage would be confused and copy only the first 2
cluster. The results of this copy will be an image exposing the third
cluster from the base image, corrupting the guest data.

I found that it can be fixed using BDRV_BLOCK_OFFSET_VALID when
reporting the status of the extent. When we have a valid offset, we
report based on BDRV_BLOCK_DATA. Otherwise we report based on
BDRV_BLOCK_ALLOCATED.

With this fix we get:

    $ build/qemu-img map --output json top.qcow2
    [{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, "offset": 0},
    { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
    { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": true},
    { "start": 196608, "length": 65536, "depth": 1, "zero": true, "data": false, "offset": 196608}]

    $ build/qemu-nbd -r -t -f qcow2 top.qcow2 &
    $ qemu-img map --output json nbd://localhost
    [{ "start": 0, "length": 131072, "depth": 0, "zero": false, "data": true, "offset": 0},
    { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": true, "offset": 131072},
    { "start": 196608, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": 196608}]

    $ nbdinfo --map nbd://localhost
             0      131072    0  allocated
        131072       65536    2  zero
        196608       65536    3  hole,zero

The issue was found by ovirt-imageio functional tests:
https://github.com/oVirt/ovirt-imageio/blob/master/daemon/test/client_test.py

I did not update any of the existing tests, and I'm sure many tests are
missing, and the documentation should change to describe the new
behavior. Posting as is for early review.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Resolves: https://bugzilla.redhat.com/1968693
---
 nbd/server.c | 8 ++++++--
 qemu-img.c   | 4 +++-
 2 files changed, 9 insertions(+), 3 deletions(-)

Comments

Eric Blake June 7, 2021, 9:22 p.m. UTC | #1
On Mon, Jun 07, 2021 at 11:22:04PM +0300, Nir Soffer wrote:
> When zeroing a cluster in an image with backing file, qemu-img and
> qemu-nbd reported the area as a hole. This does not affect the guest
> since the area is read as zero, but breaks code trying to reconstruct
> the image chain based on qemu-img map or qemu-nbd block status response.

Trying to reconstruct the image chain based on qemu-nbd block status
should not be attempted on just base:allocation data, but should also
take into account qemu:allocation-depth.  From the perspective of the
core NBD protocol, there is no backing file, so trying to guess what
the backing file contains without using qemu extensions is unlikely to
be correct, as shown in your example.  The fact that you could abuse
it with qemu 5.2 but it broke in 6.0 is not necessarily the sign of a
regression in 6.0, but rather could be evidence that you have been
trying to use an undocumented implementation quirk rather than a
stable interface.

> 
> Here is simpler reproducer:
> 
>     # Create a qcow2 image with a raw backing file:
>     $ qemu-img create base.raw $((4*64*1024))
>     $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2
> 
>     # Write to first 3 clusters of base:
>     $ qemu-io -f raw -c "write -P 65 0 64k" base.raw
>     $ qemu-io -f raw -c "write -P 66 64k 64k" base.raw
>     $ qemu-io -f raw -c "write -P 67 128k 64k" base.raw
> 
>     # Write to second cluster of top, hiding second cluster of base:
>     $ qemu-io -f qcow2 -c "write -P 69 64k 64k" top.qcow2
> 
>     # Write zeroes to third cluster of top, hiding third cluster of base:
>     $ qemu-io -f qcow2 -c "write -z 128k 64k" top.qcow2
> 
> This creates:
> 
>     top:  -D0-
>     base: ABC-
> 
> How current qemu-img and qemu-nbd report the state:
> 
>     $ qemu-img map --output json top.qcow2
>     [{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, "offset": 0},
>     { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
>     { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
>     { "start": 196608, "length": 65536, "depth": 1, "zero": true, "data": false, "offset": 196608}]

Note how this one reports "depth":1 when the backing file is consulted...

> 
>     $ qemu-nbd -r -t -f qcow2 top.qcow2 &
>     $ qemu-img map --output json nbd://localhost
>     [{ "start": 0, "length": 131072, "depth": 0, "zero": false, "data": true, "offset": 0},
>     { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false, "offset": 131072}]

...but since NBD has no notion of a backing file, there is nothing
that qemu can do to report depth information itself.  If you want to
reconstruct the backing chain, you should be able to further query
qemu:allocation-depth, and piece the two queries together to get what
you need:

$ ./qemu-nbd -r -t -f qcow2 top.qcow2 -A
$ nbdinfo --map=qemu:allocation-depth nbd://localhost
         0      131072    1  local
    131072      131072    2  backing depth 2

However, _that_ output looks odd - it claims that clusters 0 and 1 are
local, and 2 and 3 come from a backing file.  Without reading code, I
would have expected something closer to the qcow2 view, claiming that
clusters 1 and 2 are local, while 0 and 3 come from a backing file (3
could also be reported as unallocated, but only if you use a qcow2 as
the backing file instead of raw, since we have no easy way to
determine which holes map to file system allocations in raw files).

/me goes to debug...  I'll need to reply in a later email when I've
spent more time on that.

[Oh, and that reminds me, I would love to patch nbdinfo to let --map
query all available contexts, not just base:allocation, without having
to explicitly name alternative --map=FOO... But it missed today's
stable release of libnbd 1.8]

[The same information can be obtained via qemu-img using
x-dirty-bitmap and --image-opts, but is so much more of a hack that
for now I will just refer to iotest 309 instead of spelling it out
here]

> 
>     $ nbdinfo --map nbd://localhost
>              0      131072    0  allocated
>         131072      131072    3  hole,zero

This faithfully reflects what qemu-img saw, which is all the more the
NBD protocol lets us send without the use of extensions like
qemu:allocation-depth.

> 
> The third extents is reported as a hole in both cases. In qmeu-nbd the

qemu

> cluster is merged with forth cluster which is actually a hole.
> 
> This is incorrect since if it was a hole, the third cluster would be
> exposed to the guest. Programs using qemu-nbd output to reconstruct the
> image chain on other storage would be confused and copy only the first 2
> cluster. The results of this copy will be an image exposing the third
> cluster from the base image, corrupting the guest data.

This is where I disagree - if the NBD protocol exposed the notion of a
backing file, then reporting a local hole should indeed imply reading
from the backing file.  But since base NBD protocol does NOT support
backing images of any sort, you cannot make any assumptions about what
base:allocation says about holes, other than that the image presented
over NBD was not fully allocated in some manner at that location.  You
instead need to fix your tooling to query qemu:allocation-depth if you
are trying to reconstruct all state known by qemu.

> 
> I found that it can be fixed using BDRV_BLOCK_OFFSET_VALID when
> reporting the status of the extent. When we have a valid offset, we
> report based on BDRV_BLOCK_DATA. Otherwise we report based on
> BDRV_BLOCK_ALLOCATED.

That sounds hairy.  As an initial reaction, I'm not sure I like it.

> 
> With this fix we get:
> 
>     $ build/qemu-img map --output json top.qcow2
>     [{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, "offset": 0},
>     { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
>     { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": true},

Earlier, this line was "data":false, which made sense - there was no
offset in either top.qcow2 nor in raw.base at which you could mmap to
read the actual zeroes that were implied by the unallocated zero
cluster.  Your change here is reporting "data":true to state that the
zeroes explicitly come from the "depth":0 layer, although it is a bit
misleading because we did not actually allocate clusters in top.qcow2
for reading the zeroes.  In reality, this really IS an instance of a
qcow2 unallocated cluster, where "data":false fits better for the
definitions in include/block/block.h.

>     { "start": 196608, "length": 65536, "depth": 1, "zero": true, "data": false, "offset": 196608}]
> 
>     $ build/qemu-nbd -r -t -f qcow2 top.qcow2 &
>     $ qemu-img map --output json nbd://localhost
>     [{ "start": 0, "length": 131072, "depth": 0, "zero": false, "data": true, "offset": 0},
>     { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": true, "offset": 131072},
>     { "start": 196608, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": 196608}]

Meanwhile, this output is indeed arguably more precise, although it
differs from the qcow2 output in that you provide an offset for
cluster 2.

> 
>     $ nbdinfo --map nbd://localhost
>              0      131072    0  allocated
>         131072       65536    2  zero
>         196608       65536    3  hole,zero
> 
> The issue was found by ovirt-imageio functional tests:
> https://github.com/oVirt/ovirt-imageio/blob/master/daemon/test/client_test.py
> 
> I did not update any of the existing tests, and I'm sure many tests are
> missing, and the documentation should change to describe the new
> behavior. Posting as is for early review.
> 
> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> Resolves: https://bugzilla.redhat.com/1968693
> ---
>  nbd/server.c | 8 ++++++--
>  qemu-img.c   | 4 +++-
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index b60ebc3ab6..adf37905d5 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -2127,8 +2127,12 @@ static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset,
>              return ret;
>          }
>  
> -        flags = (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE) |
> -                (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
> +        flags = (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
> +
> +        if (ret & BDRV_BLOCK_OFFSET_VALID)
> +            flags |= (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE);
> +        else
> +            flags |= (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE);

This will fall apart on compressed or encrypted images, where data is
allocated but offset_valid is false.

>  
>          if (nbd_extent_array_add(ea, num, flags) < 0) {
>              return 0;
> diff --git a/qemu-img.c b/qemu-img.c
> index a5993682aa..6808e12d87 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -3039,7 +3039,9 @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
>      *e = (MapEntry) {
>          .start = offset,
>          .length = bytes,
> -        .data = !!(ret & BDRV_BLOCK_DATA),
> +        .data = !!(has_offset
> +            ? ret & BDRV_BLOCK_DATA
> +            : ret & BDRV_BLOCK_ALLOCATED),

I'm really not sure about this.  You are not only changing what
qemu-nbd advertises as a server, but also what qemu-img interprets as
a client.  Are you sure this will still work when you mix-and-match
old server + new client, or new server + old client?

>          .zero = !!(ret & BDRV_BLOCK_ZERO),
>          .offset = map,
>          .has_offset = has_offset,
> -- 
> 2.26.3
>

In short, I agree that the current situation is awkward, but I'm not
sure that this patch is right.  Rather, I'm wondering if we have a bug
in qemu:allocation-depth, and where once that is fixed, you should be
using that alongside base:allocation when deciding how to guess on how
to reconstruct a qcow2 backing chain using only information learned
over NBD.
Eric Blake June 7, 2021, 10:04 p.m. UTC | #2
On Mon, Jun 07, 2021 at 04:22:27PM -0500, Eric Blake wrote:

[replying to myself]

> > Here is simpler reproducer:
> > 
> >     # Create a qcow2 image with a raw backing file:
> >     $ qemu-img create base.raw $((4*64*1024))
> >     $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2
> > 
> >     # Write to first 3 clusters of base:
> >     $ qemu-io -f raw -c "write -P 65 0 64k" base.raw
> >     $ qemu-io -f raw -c "write -P 66 64k 64k" base.raw
> >     $ qemu-io -f raw -c "write -P 67 128k 64k" base.raw
> > 
> >     # Write to second cluster of top, hiding second cluster of base:
> >     $ qemu-io -f qcow2 -c "write -P 69 64k 64k" top.qcow2
> > 
> >     # Write zeroes to third cluster of top, hiding third cluster of base:
> >     $ qemu-io -f qcow2 -c "write -z 128k 64k" top.qcow2

Aha. While reproducing this locally, I typoed this as 'write -z 12k
64k', which absolutely changes the map produced...

> 
> $ ./qemu-nbd -r -t -f qcow2 top.qcow2 -A
> $ nbdinfo --map=qemu:allocation-depth nbd://localhost
>          0      131072    1  local
>     131072      131072    2  backing depth 2
> 
> However, _that_ output looks odd - it claims that clusters 0 and 1 are
> local, and 2 and 3 come from a backing file.  Without reading code, I
> would have expected something closer to the qcow2 view, claiming that
> clusters 1 and 2 are local, while 0 and 3 come from a backing file (3
> could also be reported as unallocated, but only if you use a qcow2 as
> the backing file instead of raw, since we have no easy way to
> determine which holes map to file system allocations in raw files).

and totally explains my confusion here.

> 
> /me goes to debug...  I'll need to reply in a later email when I've
> spent more time on that.
> 

After recreating the file properly, by writing the zeroes at 128k
instead of 12k, I now see:

$ nbdinfo --map=qemu:allocation-depth nbd://localhost
         0       65536    2  backing depth 2
     65536      131072    1  local
    196608       65536    2  backing depth 2

which is EXACTLY what I expected.  And sufficient for you to recreate
your backing chain:

Cluster 0 is backing depth 2 + allocated, so it comes from the backing
file; nothing to write in your replacement top.qcow2.  Cluster 1 is
local + allocated, so it comes from top.qcow2 and consists of actual
data, definitely write that one.  Cluster 2 is local + hole,zero, so
it reads as zero, but comes from top.qcow2 without any allocation;
when building your replacement .qcow2 file, you MUST write this
cluster to match the local allocation and override anything being
inherited from the backing file, but it is up to you whether you write
it as allocated zeroes or as an unallocated but reads-as-zero cluster.
Cluster 3 is backing depth 2 + hole,zero, which means it was read from
the backing file, and you can safely omit it from your replacement
top.qcow2.

> In short, I agree that the current situation is awkward, but I'm not
> sure that this patch is right.  Rather, I'm wondering if we have a bug
> in qemu:allocation-depth, and where once that is fixed, you should be
> using that alongside base:allocation when deciding how to guess on how
> to reconstruct a qcow2 backing chain using only information learned
> over NBD.

And since the problem was in my command line transcription skills, and
not in qemu proper, I don't think we want this patch; rather, I feel
we are better served if you could fix your downstream tooling to start
using qemu:allocation-depth if you are trying to recreate which
portions of a qcow2 file MUST be written in order for that qcow2 file
backed by a different image to provide the same data as seen over NBD.
Nir Soffer June 8, 2021, 4:38 p.m. UTC | #3
On Tue, Jun 8, 2021 at 12:22 AM Eric Blake <eblake@redhat.com> wrote:
>
> On Mon, Jun 07, 2021 at 11:22:04PM +0300, Nir Soffer wrote:
> > When zeroing a cluster in an image with backing file, qemu-img and
> > qemu-nbd reported the area as a hole. This does not affect the guest
> > since the area is read as zero, but breaks code trying to reconstruct
> > the image chain based on qemu-img map or qemu-nbd block status response.
>
> Trying to reconstruct the image chain based on qemu-nbd block status
> should not be attempted on just base:allocation data, but should also
> take into account qemu:allocation-depth.

This is correct when looking at the entire chain, but when we reconstruct
image data, we copy each image in the layer *without* the backing chain.

The example I provided was not detailed enough, what we actually do is:

    qemu-nbd .. 'json:{"driver": "qcow2", "backing": null, "file":
{"driver": "file", "filename": "top.qcow2"}}'

So there is no backing chain and allocation depth is not relevant.
- Allocated areas should be reported with flags 0
- Zero areas which are not holes should be reported as NBD_STATE_ZERO
- Zero areas which are holes (not allocated in this image) should be
reported as NBD_STATE_HOLE

> From the perspective of the
> core NBD protocol, there is no backing file, so trying to guess what
> the backing file contains without using qemu extensions is unlikely to
> be correct, as shown in your example.  The fact that you could abuse
> it with qemu 5.2 but it broke in 6.0

I'm not abusing anything, I'm only using public APIs. qemu-nbd behavior
should not change without good reason, and we did not have any good
reason to change the behavior for qcow2 images.

>  is not necessarily the sign of a
> regression in 6.0, but rather could be evidence that you have been
> trying to use an undocumented implementation quirk rather than a
> stable interface.

I'm pretty convinced that this is a regression in qemu-nbd 6.0 since I created
this regression :-)

Since we started using qemu-nbd in 2018, qemu-nbd has always reported
holes in qcow2 images, but not in raw files. We discussed this several times,
and you explained that we have allocation information from qcow2, but not
from raw format.

My attempt to fix hole reporting in raw images has failed; reporting holes in
raw images is nice to have, but it broke the behavior of qemu-nbd with qcow2
images, which is a critical issue for ovirt.

The code using this was tested and released 3-4 month ago. This was added
to support backup vendors using snapshot based backup, so they can move
to use the NBD based pipeline, which is safer than the old way, uploading
qcow2 images directly to storage.

If I revert:

commit 0da9856851dcca09222a1467e16ddd05dc66e460
Author: Nir Soffer <nirsof@gmail.com>
Date:   Fri Feb 19 18:07:52 2021 +0200

    nbd: server: Report holes for raw images

qemu-nbd reports zeroed areas in a useful way like it always did:

$ ./qemu-nbd -r -t 'json:{"driver": "qcow2", "backing": null, "file":
{"driver": "file", "filename": "top.qcow2"}}' &

$ nbdinfo --map nbd://localhost
         0       65536    3  hole,zero
     65536       65536    0  allocated
    131072       65536    2  zero
    196608       65536    3  hole,zero

There is no need to use allocation depth info, the base:allocation works fine
for this use case, and the output makes sense.

> > Here is simpler reproducer:
> >
> >     # Create a qcow2 image with a raw backing file:
> >     $ qemu-img create base.raw $((4*64*1024))
> >     $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2
> >
> >     # Write to first 3 clusters of base:
> >     $ qemu-io -f raw -c "write -P 65 0 64k" base.raw
> >     $ qemu-io -f raw -c "write -P 66 64k 64k" base.raw
> >     $ qemu-io -f raw -c "write -P 67 128k 64k" base.raw
> >
> >     # Write to second cluster of top, hiding second cluster of base:
> >     $ qemu-io -f qcow2 -c "write -P 69 64k 64k" top.qcow2
> >
> >     # Write zeroes to third cluster of top, hiding third cluster of base:
> >     $ qemu-io -f qcow2 -c "write -z 128k 64k" top.qcow2
> >
> > This creates:
> >
> >     top:  -D0-
> >     base: ABC-
> >
> > How current qemu-img and qemu-nbd report the state:
> >
> >     $ qemu-img map --output json top.qcow2
> >     [{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, "offset": 0},
> >     { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
> >     { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
> >     { "start": 196608, "length": 65536, "depth": 1, "zero": true, "data": false, "offset": 196608}]
>
> Note how this one reports "depth":1 when the backing file is consulted...

Yes, qemu-img includes enough info to tell about the status of the cluster,
so we can keep it as is.

> >     $ qemu-nbd -r -t -f qcow2 top.qcow2 &
> >     $ qemu-img map --output json nbd://localhost
> >     [{ "start": 0, "length": 131072, "depth": 0, "zero": false, "data": true, "offset": 0},
> >     { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false, "offset": 131072}]
>
> ...but since NBD has no notion of a backing file, there is nothing
> that qemu can do to report depth information itself.  If you want to
> reconstruct the backing chain, you should be able to further query
> qemu:allocation-depth, and piece the two queries together to get what
> you need:
>
> $ ./qemu-nbd -r -t -f qcow2 top.qcow2 -A

What suggest is:

1. Get base:allocation - where the cluster is wrongly reported a hole
2. Get also qemu:allocation-depath
3. Merge both lists of extents, splitting and extending extents as needed
   since there is no guarantee that we get same lists describing same extents
4. Finally get correct base:allocation showing the zero range is not a hole

This is a lot of unneeded and complicated work to cover a bug in qemu,
and it should be done in all clients. I don't find this an attractive solution.

> > cluster is merged with forth cluster which is actually a hole.
> >
> > This is incorrect since if it was a hole, the third cluster would be
> > exposed to the guest. Programs using qemu-nbd output to reconstruct the
> > image chain on other storage would be confused and copy only the first 2
> > cluster. The results of this copy will be an image exposing the third
> > cluster from the base image, corrupting the guest data.
>
> This is where I disagree - if the NBD protocol exposed the notion of a
> backing file, then reporting a local hole should indeed imply reading
> from the backing file.  But since base NBD protocol does NOT support
> backing images of any sort, you cannot make any assumptions about what
> base:allocation says about holes, other than that the image presented
> over NBD was not fully allocated in some manner at that location.  You
> instead need to fix your tooling to query qemu:allocation-depth if you
> are trying to reconstruct all state known by qemu.

When we look at the entire chain, you are correct, but when we disable
the backing file and have no backing chain, NBD protocol hole bit is
the right way to describe an unallocated area in a qcow2 image.

> > I found that it can be fixed using BDRV_BLOCK_OFFSET_VALID when
> > reporting the status of the extent. When we have a valid offset, we
> > report based on BDRV_BLOCK_DATA. Otherwise we report based on
> > BDRV_BLOCK_ALLOCATED.
>
> That sounds hairy.  As an initial reaction, I'm not sure I like it.
>
> >
> > With this fix we get:
> >
> >     $ build/qemu-img map --output json top.qcow2
> >     [{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, "offset": 0},
> >     { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
> >     { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": true},
>
> Earlier, this line was "data":false, which made sense - there was no
> offset in either top.qcow2 nor in raw.base at which you could mmap to
> read the actual zeroes that were implied by the unallocated zero
> cluster.  Your change here is reporting "data":true to state that the
> zeroes explicitly come from the "depth":0 layer, although it is a bit
> misleading because we did not actually allocate clusters in top.qcow2
> for reading the zeroes.  In reality, this really IS an instance of a
> qcow2 unallocated cluster, where "data":false fits better for the
> definitions in include/block/block.h.

Well, this cluster is not a hole. How do you describe "not a hole" in
qemu-img map?
data=true seem that be the only way.

But again we can leave qemu-img map as is since we have the depth argument.

> >     { "start": 196608, "length": 65536, "depth": 1, "zero": true, "data": false, "offset": 196608}]
> >
> >     $ build/qemu-nbd -r -t -f qcow2 top.qcow2 &
> >     $ qemu-img map --output json nbd://localhost
> >     [{ "start": 0, "length": 131072, "depth": 0, "zero": false, "data": true, "offset": 0},
> >     { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": true, "offset": 131072},
> >     { "start": 196608, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": 196608}]
>
> Meanwhile, this output is indeed arguably more precise, although it
> differs from the qcow2 output in that you provide an offset for
> cluster 2.

I guess qemu-img have less information when using qemu-nbd, the image
is presented as
a raw image.

> >     $ nbdinfo --map nbd://localhost
> >              0      131072    0  allocated
> >         131072       65536    2  zero
> >         196608       65536    3  hole,zero
> >
> > The issue was found by ovirt-imageio functional tests:
> > https://github.com/oVirt/ovirt-imageio/blob/master/daemon/test/client_test.py
> >
> > I did not update any of the existing tests, and I'm sure many tests are
> > missing, and the documentation should change to describe the new
> > behavior. Posting as is for early review.
> >
> > Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> > Resolves: https://bugzilla.redhat.com/1968693
> > ---
> >  nbd/server.c | 8 ++++++--
> >  qemu-img.c   | 4 +++-
> >  2 files changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/nbd/server.c b/nbd/server.c
> > index b60ebc3ab6..adf37905d5 100644
> > --- a/nbd/server.c
> > +++ b/nbd/server.c
> > @@ -2127,8 +2127,12 @@ static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset,
> >              return ret;
> >          }
> >
> > -        flags = (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE) |
> > -                (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
> > +        flags = (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
> > +
> > +        if (ret & BDRV_BLOCK_OFFSET_VALID)
> > +            flags |= (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE);
> > +        else
> > +            flags |= (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE);
>
> This will fall apart on compressed or encrypted images, where data is
> allocated but offset_valid is false.

So this is not the right way to trigger the use of BDRV_BLOCK_DATA.

> >          if (nbd_extent_array_add(ea, num, flags) < 0) {
> >              return 0;
> > diff --git a/qemu-img.c b/qemu-img.c
> > index a5993682aa..6808e12d87 100644
> > --- a/qemu-img.c
> > +++ b/qemu-img.c
> > @@ -3039,7 +3039,9 @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
> >      *e = (MapEntry) {
> >          .start = offset,
> >          .length = bytes,
> > -        .data = !!(ret & BDRV_BLOCK_DATA),
> > +        .data = !!(has_offset
> > +            ? ret & BDRV_BLOCK_DATA
> > +            : ret & BDRV_BLOCK_ALLOCATED),
>
> I'm really not sure about this.  You are not only changing what
> qemu-nbd advertises as a server, but also what qemu-img interprets as
> a client.  Are you sure this will still work when you mix-and-match
> old server + new client, or new server + old client?

I did not test these combinations, but let's simplify by removing this change
for now.

> >          .zero = !!(ret & BDRV_BLOCK_ZERO),
> >          .offset = map,
> >          .has_offset = has_offset,
> > --
> > 2.26.3
> >

Since the original change introduced a regression breaking existing users, we
don't have good enough tests in this area, and is not clear how this
can be fixed
properly, I suggest to revert the original change for 6.0. Then we have time to
work on a proper fix upstream.

What do you think?

Nir
Eric Blake June 8, 2021, 6:45 p.m. UTC | #4
On Tue, Jun 08, 2021 at 07:38:10PM +0300, Nir Soffer wrote:
> On Tue, Jun 8, 2021 at 12:22 AM Eric Blake <eblake@redhat.com> wrote:
> >
> > On Mon, Jun 07, 2021 at 11:22:04PM +0300, Nir Soffer wrote:
> > > When zeroing a cluster in an image with backing file, qemu-img and
> > > qemu-nbd reported the area as a hole. This does not affect the guest
> > > since the area is read as zero, but breaks code trying to reconstruct
> > > the image chain based on qemu-img map or qemu-nbd block status response.
> >
> > Trying to reconstruct the image chain based on qemu-nbd block status
> > should not be attempted on just base:allocation data, but should also
> > take into account qemu:allocation-depth.
> 
> This is correct when looking at the entire chain, but when we reconstruct
> image data, we copy each image in the layer *without* the backing chain.
> 
> The example I provided was not detailed enough, what we actually do is:
> 
>     qemu-nbd .. 'json:{"driver": "qcow2", "backing": null, "file":
> {"driver": "file", "filename": "top.qcow2"}}'
> 
> So there is no backing chain and allocation depth is not relevant.
> - Allocated areas should be reported with flags 0
> - Zero areas which are not holes should be reported as NBD_STATE_ZERO
> - Zero areas which are holes (not allocated in this image) should be
> reported as NBD_STATE_HOLE

Again, what you WANT is qemu:allocation-depth.

$ ./qemu-nbd -r -t -f qcow2 -A 'json:{"driver":"qcow2", "backing":null, \
  "file":{"driver":"file", "filename":"top.qcow2"}}'
$ nbdinfo --map=qemu:allocation-depth nbd://localhost
         0       65536    0  unallocated
     65536      131072    1  local
    196608       65536    0  unallocated

$ nbdinfo --map nbd://localhost
         0       65536    3  hole,zero
     65536       65536    0  allocated
    131072      131072    3  hole,zero

You don't care whether the information reads as zero or not, but
whether top.qcow2 is responsible for the data at that cluster.
base:allocation does not answer that question.  But
qemu:allocation-depth answers it perfectly.

> 
> > From the perspective of the
> > core NBD protocol, there is no backing file, so trying to guess what
> > the backing file contains without using qemu extensions is unlikely to
> > be correct, as shown in your example.  The fact that you could abuse
> > it with qemu 5.2 but it broke in 6.0
> 
> I'm not abusing anything, I'm only using public APIs. qemu-nbd behavior
> should not change without good reason, and we did not have any good
> reason to change the behavior for qcow2 images.

Ah, but we did.  Exposing BDRV_BLOCK_ALLOCATED as server, but
consuming it as BDRV_BLOCK_DATA as client, was inconsistent.  It was a
bug that we ever used BLOCK_ALLOCATED in the first place, when it has
_always_ been that the NBD semantics were supposed to be modeled on
our definition of BLOCK_DATA.  That it took us a couple of years to
notice our bug is unfortunate, but we DO have a good reason for the
change - we were fixing an actual bug where we were reporting
incorrect information compared to what the NBD spec was documenting.

> 
> >  is not necessarily the sign of a
> > regression in 6.0, but rather could be evidence that you have been
> > trying to use an undocumented implementation quirk rather than a
> > stable interface.
> 
> I'm pretty convinced that this is a regression in qemu-nbd 6.0 since I created
> this regression :-)

I understand that you were surprised by the ramifications of your
patch causing more changes than what you expected, but I still argue
that your patch was correct and that the decision to incorporate it
was intentional because it was the right thing to do.  Papering over
the fallout for the sake of clients that should be using
qemu:allocation-depth instead does not seem like it is worth the
maintenance nightmare to me.

> 
> Since we started using qemu-nbd in 2018, qemu-nbd has always reported
> holes in qcow2 images, but not in raw files. We discussed this several times,
> and you explained that we have allocation information from qcow2, but not
> from raw format.
> 
> My attempt to fix hole reporting in raw images has failed; reporting holes in
> raw images is nice to have, but it broke the behavior of qemu-nbd with qcow2
> images, which is a critical issue for ovirt.

Rather, ovirt had been relying on buggy behavior, and now that the bug
has been fixed, we are scrambling to figure out how to make ovirt
still play nicely.  But my answer to that is to use
qemu:allocation-depth.  It was introduced in 5.2, so it predates the
point where base:allocation behavior was fixed, and it provides the
answer to the question you are really asking (which parts of my image
came from the image directly, rather than a backing file), rather than
merely an indirect answer (how can I abuse the determination of which
parts of the image are allocated or sparse to imply that those same
portions must come from a backing image).  There is nothing
semantically wrong with a sparse cluster in the top layer overriding a
non-sparse cluster from the backing file.

> 
> The code using this was tested and released 3-4 month ago. This was added
> to support backup vendors using snapshot based backup, so they can move
> to use the NBD based pipeline, which is safer than the old way, uploading
> qcow2 images directly to storage.

Downstream can play games with backporting or reverting patches as
needed to make the entire ecosystem play nicely, but upstream, I don't
think it is right to take a patch that re-introduces a bug in
base:allocation merely because a client was unwilling to use the
correct interface of qemu:allocation-depth to get the answer to the
question they really meant to ask.

> 
> If I revert:
> 
> commit 0da9856851dcca09222a1467e16ddd05dc66e460
> Author: Nir Soffer <nirsof@gmail.com>
> Date:   Fri Feb 19 18:07:52 2021 +0200
> 
>     nbd: server: Report holes for raw images
> 
> qemu-nbd reports zeroed areas in a useful way like it always did:
> 
> $ ./qemu-nbd -r -t 'json:{"driver": "qcow2", "backing": null, "file":
> {"driver": "file", "filename": "top.qcow2"}}' &
> 
> $ nbdinfo --map nbd://localhost
>          0       65536    3  hole,zero
>      65536       65536    0  allocated
>     131072       65536    2  zero
>     196608       65536    3  hole,zero
> 
> There is no need to use allocation depth info, the base:allocation works fine
> for this use case, and the output makes sense.

But you have to remember that the NBD specification for
base:allocation is that flags are OPTIONAL at all times.  You are at
the mercy of the server for which flags it sets; it is ALWAYS valid
for a server to reply with status 0 for any cluster; any server is
free to return 0 in the future where it returned 1 today (perhaps
because the server changed its mind on how expensive it was to compute
when it is safe to set the bit to 1), or to return 1 in the future
where it returned 0 today (perhaps because the server gained insight
into how to quickly report something that happens to be true about
that cluster).  Basing decisions on what was returned as a 1 bit in
NBD_CMD_BLOCK_STATUS is okay if those decisions are solely for an
optimization purpose (because we KNOW this portion of this disk will
read as zero, we don't have to read it), but wrong if they are for an
accuracy purpose (assuming that only clusters where hole is clear is a
1:1 correspondence to which clusters override a backing image).  And
this is a case where qcow2 happens to quickly know that a sparse zero
cluster exists, so reporting hole,zero IS more accurate than mere
zero.  The fact that we failed to omit hole in the past was not a
violation of the NBD spec, but rather a bug in our code for looking at
the wrong bit of information.  But the mere presence of the hole bit
is NOT sufficient knowledge to know that the NBD server is serving a
qcow2 image that has a non-allocated zero cluster overriding the
backing image, because we can't even guarantee the server is visiting
a qcow2 file.

> > ...but since NBD has no notion of a backing file, there is nothing
> > that qemu can do to report depth information itself.  If you want to
> > reconstruct the backing chain, you should be able to further query
> > qemu:allocation-depth, and piece the two queries together to get what
> > you need:
> >
> > $ ./qemu-nbd -r -t -f qcow2 top.qcow2 -A
> 
> What suggest is:
> 
> 1. Get base:allocation - where the cluster is wrongly reported a hole

The report is not wrong.  That portion of the qcow2 file IS a hole, in
that it is not pre-allocated and therefore has no on-disk storage
reserved.  It will still read as all-zeroes rather than deferring to
the backing image, but calling it a hole is not wrong.

> 2. Get also qemu:allocation-depath
> 3. Merge both lists of extents, splitting and extending extents as needed
>    since there is no guarantee that we get same lists describing same extents
> 4. Finally get correct base:allocation showing the zero range is not a hole
> 
> This is a lot of unneeded and complicated work to cover a bug in qemu,
> and it should be done in all clients. I don't find this an attractive solution.

Unfortunately, it is the only viable long-term solution.  I would love
to make libnbd handle the reconstruction of a unified view of multiple
NBD metadata contexts in an easier manner, so that upper-layer
applications could simply use libnbd to get at that work without
having to repeat the coding of reconstructing things themselves.  But
I don't see any way around the fact that from a correctness
perspective, this really IS the only way to get at the information you
need (not only a knowledge of which portions of the disk are holes,
but also which portions of the disk override a backing image).

> 
> > > cluster is merged with forth cluster which is actually a hole.
> > >
> > > This is incorrect since if it was a hole, the third cluster would be
> > > exposed to the guest. Programs using qemu-nbd output to reconstruct the
> > > image chain on other storage would be confused and copy only the first 2
> > > cluster. The results of this copy will be an image exposing the third
> > > cluster from the base image, corrupting the guest data.
> >
> > This is where I disagree - if the NBD protocol exposed the notion of a
> > backing file, then reporting a local hole should indeed imply reading
> > from the backing file.  But since base NBD protocol does NOT support
> > backing images of any sort, you cannot make any assumptions about what
> > base:allocation says about holes, other than that the image presented
> > over NBD was not fully allocated in some manner at that location.  You
> > instead need to fix your tooling to query qemu:allocation-depth if you
> > are trying to reconstruct all state known by qemu.
> 
> When we look at the entire chain, you are correct, but when we disable
> the backing file and have no backing chain, NBD protocol hole bit is
> the right way to describe an unallocated area in a qcow2 image.

Your problem is that qcow2 images have two ways to represent a zero
cluster.  One is a pre-allocated cluster (there is host storage
reserved; the host storage might or might not read as zero, but the
qcow2 code will return zero, regardless of whether there is a backing
file); the other is an unallocated cluster (there is no host storage
reserved, but the qcow2 code still reads zero regardless of whether
there is a backing file).  One of those is allocated, the other is
sparse.  And NBD_STATE_HOLE is a great fit for telling those two types
of clusters apart.

In short, qemu has 3 bits of interesting information per cluster:
whether it reads as zero, whether it has allocated space, and whether
it overrides a backing file.  The NBD protocol base:allocation only
provides two of those bits of information (BDRV_BLOCK_DATA and
BDRV_BLOCK_ZERO); to get at the third (BDRV_BLOCK_ALLOCATED), you HAVE
to use a qemu-specific extension.

That said, we CAN provide a _new_ qemu:block-status metacontext, which
returns a bitmap of bit 0 (NBD_STATE_HOLE), bit 1 (NBD_STATE_ZERO),
and bit 2 (BDRV_BLOCK_ALLOCATED) all from a single context, at which
point you don't have to reconstruct the combination of base:allocation
(two bits) and qemu:allocation-depth (an arbitrary integer) across
possibly differing reporting lengths yourself.

I'll post a counter-proposal patch along those lines, shortly.

> 
> > > I found that it can be fixed using BDRV_BLOCK_OFFSET_VALID when
> > > reporting the status of the extent. When we have a valid offset, we
> > > report based on BDRV_BLOCK_DATA. Otherwise we report based on
> > > BDRV_BLOCK_ALLOCATED.
> >
> > That sounds hairy.  As an initial reaction, I'm not sure I like it.
> >
> > >
> > > With this fix we get:
> > >
> > >     $ build/qemu-img map --output json top.qcow2
> > >     [{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, "offset": 0},
> > >     { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
> > >     { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": true},
> >
> > Earlier, this line was "data":false, which made sense - there was no
> > offset in either top.qcow2 nor in raw.base at which you could mmap to
> > read the actual zeroes that were implied by the unallocated zero
> > cluster.  Your change here is reporting "data":true to state that the
> > zeroes explicitly come from the "depth":0 layer, although it is a bit
> > misleading because we did not actually allocate clusters in top.qcow2
> > for reading the zeroes.  In reality, this really IS an instance of a
> > qcow2 unallocated cluster, where "data":false fits better for the
> > definitions in include/block/block.h.
> 
> Well, this cluster is not a hole. How do you describe "not a hole" in
> qemu-img map?

Ah, but the cluster _is_ a hole.  There is no storage space occupied
in the qcow2 file corresponding to that slice of the file.  So from
the qcow2's perspective, that region of guest-visible address space
comes from a hole, where the guest writing to it will require an
allocation of additional host storage space.

"not a hole" in qcow2 corresponds to any pre-allocated cluster, and
shows up in qemu-img map as "data":true.

> data=true seem that be the only way.
> 
> But again we can leave qemu-img map as is since we have the depth argument.

And since the depth parameter answered your question with direct
qcow2, it should also answer your question over NBD with
qemu:allocation-depth.  In other words, you are using 3 bits of
information to reconstruct your qcow2 file, so you need to ask for all
3 pieces of information over NBD.

> 
> > >     { "start": 196608, "length": 65536, "depth": 1, "zero": true, "data": false, "offset": 196608}]
> > >
> > >     $ build/qemu-nbd -r -t -f qcow2 top.qcow2 &
> > >     $ qemu-img map --output json nbd://localhost
> > >     [{ "start": 0, "length": 131072, "depth": 0, "zero": false, "data": true, "offset": 0},
> > >     { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": true, "offset": 131072},
> > >     { "start": 196608, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": 196608}]
> >
> > Meanwhile, this output is indeed arguably more precise, although it
> > differs from the qcow2 output in that you provide an offset for
> > cluster 2.
> 
> I guess qemu-img have less information when using qemu-nbd, the image
> is presented as
> a raw image.
> 
...
> > > -        flags = (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE) |
> > > -                (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
> > > +        flags = (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
> > > +
> > > +        if (ret & BDRV_BLOCK_OFFSET_VALID)
> > > +            flags |= (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE);
> > > +        else
> > > +            flags |= (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE);
> >
> > This will fall apart on compressed or encrypted images, where data is
> > allocated but offset_valid is false.
> 
> So this is not the right way to trigger the use of BDRV_BLOCK_DATA.

The intent, all along, was that NBD's STATE_HOLE was identical in
semantics to !BDRV_BLOCK_DATA (the inverted value was chosen because
having a default of all 0s as the safe point was easier to document in
the NBD spec).  Vladimir and I had a bug when first implementing that
intent, but your patch fixed that bug.  Unfortunately, as you have
pointed out, that fix has now exposed clients that were abusing the
bug to infer information about an NBD server that was not actually
viable to be inferred (whether a portion of an image is a hole is NOT
a reliable indicator of whether that portion of the image overrides a
backing file, but it wasn't until we started accurately reporting
HOLEs in more contexts that it actually tripped you up).

> 
> > >          if (nbd_extent_array_add(ea, num, flags) < 0) {
> > >              return 0;
> > > diff --git a/qemu-img.c b/qemu-img.c
> > > index a5993682aa..6808e12d87 100644
> > > --- a/qemu-img.c
> > > +++ b/qemu-img.c
> > > @@ -3039,7 +3039,9 @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
> > >      *e = (MapEntry) {
> > >          .start = offset,
> > >          .length = bytes,
> > > -        .data = !!(ret & BDRV_BLOCK_DATA),
> > > +        .data = !!(has_offset
> > > +            ? ret & BDRV_BLOCK_DATA
> > > +            : ret & BDRV_BLOCK_ALLOCATED),
> >
> > I'm really not sure about this.  You are not only changing what
> > qemu-nbd advertises as a server, but also what qemu-img interprets as
> > a client.  Are you sure this will still work when you mix-and-match
> > old server + new client, or new server + old client?
> 
> I did not test these combinations, but let's simplify by removing this change
> for now.
> 
> > >          .zero = !!(ret & BDRV_BLOCK_ZERO),
> > >          .offset = map,
> > >          .has_offset = has_offset,
> > > --
> > > 2.26.3
> > >
> 
> Since the original change introduced a regression breaking existing users, we
> don't have good enough tests in this area, and is not clear how this
> can be fixed
> properly, I suggest to revert the original change for 6.0. Then we have time to
> work on a proper fix upstream.

6.0 is already released upstream, so you are stuck with that new
behavior.  And I think it is worse to ping-pong the behavior back to
the way it was in 5.2 for 6.1, than to just keep the 6.0 behavior
(which fixed a bug) unchanged going forward into 6.1 upstream.

Downstream, we can backport and/or revert whatever patches we want to
make the downstream ecosystem play nicely.  But that is an argument
for downstream.

> 
> What do you think?

I'll go ahead and work up my counter-proposal patch to add yet another
qemu:... context to make it easier for you to get combined status in
one go.  I don't know whether it will be attractive enough to be worth
incorporating upstream, but if it is, that is certainly another option
that downstream can use, because it is easier to backport a new qemu:
context than it is to play games with "which behavior will
base:allocation have" - you are guaranteed that if qemu knows how to
advertise the new context, it provides the information you want.
Nir Soffer June 8, 2021, 8:42 p.m. UTC | #5
On Tue, Jun 8, 2021 at 9:46 PM Eric Blake <eblake@redhat.com> wrote:
>
> On Tue, Jun 08, 2021 at 07:38:10PM +0300, Nir Soffer wrote:
> > On Tue, Jun 8, 2021 at 12:22 AM Eric Blake <eblake@redhat.com> wrote:
> > >
> > > On Mon, Jun 07, 2021 at 11:22:04PM +0300, Nir Soffer wrote:
> > > > When zeroing a cluster in an image with backing file, qemu-img and
> > > > qemu-nbd reported the area as a hole. This does not affect the guest
> > > > since the area is read as zero, but breaks code trying to reconstruct
> > > > the image chain based on qemu-img map or qemu-nbd block status response.
> > >
> > > Trying to reconstruct the image chain based on qemu-nbd block status
> > > should not be attempted on just base:allocation data, but should also
> > > take into account qemu:allocation-depth.
> >
> > This is correct when looking at the entire chain, but when we reconstruct
> > image data, we copy each image in the layer *without* the backing chain.
> >
> > The example I provided was not detailed enough, what we actually do is:
> >
> >     qemu-nbd .. 'json:{"driver": "qcow2", "backing": null, "file":
> > {"driver": "file", "filename": "top.qcow2"}}'
> >
> > So there is no backing chain and allocation depth is not relevant.
> > - Allocated areas should be reported with flags 0
> > - Zero areas which are not holes should be reported as NBD_STATE_ZERO
> > - Zero areas which are holes (not allocated in this image) should be
> > reported as NBD_STATE_HOLE
>
> Again, what you WANT is qemu:allocation-depth.
>
> $ ./qemu-nbd -r -t -f qcow2 -A 'json:{"driver":"qcow2", "backing":null, \
>   "file":{"driver":"file", "filename":"top.qcow2"}}'
> $ nbdinfo --map=qemu:allocation-depth nbd://localhost
>          0       65536    0  unallocated
>      65536      131072    1  local
>     196608       65536    0  unallocated
>
> $ nbdinfo --map nbd://localhost
>          0       65536    3  hole,zero
>      65536       65536    0  allocated
>     131072      131072    3  hole,zero
>
> You don't care whether the information reads as zero or not, but
> whether top.qcow2 is responsible for the data at that cluster.
> base:allocation does not answer that question.  But
> qemu:allocation-depth answers it perfectly.
>
> >
> > > From the perspective of the
> > > core NBD protocol, there is no backing file, so trying to guess what
> > > the backing file contains without using qemu extensions is unlikely to
> > > be correct, as shown in your example.  The fact that you could abuse
> > > it with qemu 5.2 but it broke in 6.0
> >
> > I'm not abusing anything, I'm only using public APIs. qemu-nbd behavior
> > should not change without good reason, and we did not have any good
> > reason to change the behavior for qcow2 images.
>
> Ah, but we did.  Exposing BDRV_BLOCK_ALLOCATED as server, but
> consuming it as BDRV_BLOCK_DATA as client, was inconsistent.  It was a
> bug that we ever used BLOCK_ALLOCATED in the first place, when it has
> _always_ been that the NBD semantics were supposed to be modeled on
> our definition of BLOCK_DATA.  That it took us a couple of years to
> notice our bug is unfortunate, but we DO have a good reason for the
> change - we were fixing an actual bug where we were reporting
> incorrect information compared to what the NBD spec was documenting.
>
> >
> > >  is not necessarily the sign of a
> > > regression in 6.0, but rather could be evidence that you have been
> > > trying to use an undocumented implementation quirk rather than a
> > > stable interface.
> >
> > I'm pretty convinced that this is a regression in qemu-nbd 6.0 since I created
> > this regression :-)
>
> I understand that you were surprised by the ramifications of your
> patch causing more changes than what you expected, but I still argue
> that your patch was correct and that the decision to incorporate it
> was intentional because it was the right thing to do.  Papering over
> the fallout for the sake of clients that should be using
> qemu:allocation-depth instead does not seem like it is worth the
> maintenance nightmare to me.
>
> >
> > Since we started using qemu-nbd in 2018, qemu-nbd has always reported
> > holes in qcow2 images, but not in raw files. We discussed this several times,
> > and you explained that we have allocation information from qcow2, but not
> > from raw format.
> >
> > My attempt to fix hole reporting in raw images has failed; reporting holes in
> > raw images is nice to have, but it broke the behavior of qemu-nbd with qcow2
> > images, which is a critical issue for ovirt.
>
> Rather, ovirt had been relying on buggy behavior, and now that the bug
> has been fixed, we are scrambling to figure out how to make ovirt
> still play nicely.  But my answer to that is to use
> qemu:allocation-depth.  It was introduced in 5.2, so it predates the
> point where base:allocation behavior was fixed, and it provides the
> answer to the question you are really asking (which parts of my image
> came from the image directly, rather than a backing file), rather than
> merely an indirect answer (how can I abuse the determination of which
> parts of the image are allocated or sparse to imply that those same
> portions must come from a backing image).  There is nothing
> semantically wrong with a sparse cluster in the top layer overriding a
> non-sparse cluster from the backing file.
>
> >
> > The code using this was tested and released 3-4 month ago. This was added
> > to support backup vendors using snapshot based backup, so they can move
> > to use the NBD based pipeline, which is safer than the old way, uploading
> > qcow2 images directly to storage.
>
> Downstream can play games with backporting or reverting patches as
> needed to make the entire ecosystem play nicely, but upstream, I don't
> think it is right to take a patch that re-introduces a bug in
> base:allocation merely because a client was unwilling to use the
> correct interface of qemu:allocation-depth to get the answer to the
> question they really meant to ask.
>
> >
> > If I revert:
> >
> > commit 0da9856851dcca09222a1467e16ddd05dc66e460
> > Author: Nir Soffer <nirsof@gmail.com>
> > Date:   Fri Feb 19 18:07:52 2021 +0200
> >
> >     nbd: server: Report holes for raw images
> >
> > qemu-nbd reports zeroed areas in a useful way like it always did:
> >
> > $ ./qemu-nbd -r -t 'json:{"driver": "qcow2", "backing": null, "file":
> > {"driver": "file", "filename": "top.qcow2"}}' &
> >
> > $ nbdinfo --map nbd://localhost
> >          0       65536    3  hole,zero
> >      65536       65536    0  allocated
> >     131072       65536    2  zero
> >     196608       65536    3  hole,zero
> >
> > There is no need to use allocation depth info, the base:allocation works fine
> > for this use case, and the output makes sense.
>
> But you have to remember that the NBD specification for
> base:allocation is that flags are OPTIONAL at all times.  You are at
> the mercy of the server for which flags it sets; it is ALWAYS valid
> for a server to reply with status 0 for any cluster; any server is
> free to return 0 in the future where it returned 1 today (perhaps
> because the server changed its mind on how expensive it was to compute
> when it is safe to set the bit to 1), or to return 1 in the future
> where it returned 0 today (perhaps because the server gained insight
> into how to quickly report something that happens to be true about
> that cluster).  Basing decisions on what was returned as a 1 bit in
> NBD_CMD_BLOCK_STATUS is okay if those decisions are solely for an
> optimization purpose (because we KNOW this portion of this disk will
> read as zero, we don't have to read it), but wrong if they are for an
> accuracy purpose (assuming that only clusters where hole is clear is a
> 1:1 correspondence to which clusters override a backing image).  And
> this is a case where qcow2 happens to quickly know that a sparse zero
> cluster exists, so reporting hole,zero IS more accurate than mere
> zero.  The fact that we failed to omit hole in the past was not a
> violation of the NBD spec, but rather a bug in our code for looking at
> the wrong bit of information.  But the mere presence of the hole bit
> is NOT sufficient knowledge to know that the NBD server is serving a
> qcow2 image that has a non-allocated zero cluster overriding the
> backing image, because we can't even guarantee the server is visiting
> a qcow2 file.

In other words, what we do now happens to work with qemu-nbd, but
may not work with any NBD server. On the other hand, we support
only qemu-nbd because it is the only NBD server that understands qcow2.

I agree that using a qemu extension is more robust than using NBD
semantics which are not strict enough for this purpose. We probably
should have discussed the details before we started to use this for
detecting unallocated areas in qcow2 images.

...
> > > > cluster is merged with forth cluster which is actually a hole.
> > > >
> > > > This is incorrect since if it was a hole, the third cluster would be
> > > > exposed to the guest. Programs using qemu-nbd output to reconstruct the
> > > > image chain on other storage would be confused and copy only the first 2
> > > > cluster. The results of this copy will be an image exposing the third
> > > > cluster from the base image, corrupting the guest data.
> > >
> > > This is where I disagree - if the NBD protocol exposed the notion of a
> > > backing file, then reporting a local hole should indeed imply reading
> > > from the backing file.  But since base NBD protocol does NOT support
> > > backing images of any sort, you cannot make any assumptions about what
> > > base:allocation says about holes, other than that the image presented
> > > over NBD was not fully allocated in some manner at that location.  You
> > > instead need to fix your tooling to query qemu:allocation-depth if you
> > > are trying to reconstruct all state known by qemu.
> >
> > When we look at the entire chain, you are correct, but when we disable
> > the backing file and have no backing chain, NBD protocol hole bit is
> > the right way to describe an unallocated area in a qcow2 image.
>
> Your problem is that qcow2 images have two ways to represent a zero
> cluster.  One is a pre-allocated cluster (there is host storage
> reserved; the host storage might or might not read as zero, but the
> qcow2 code will return zero, regardless of whether there is a backing
> file); the other is an unallocated cluster (there is no host storage
> reserved, but the qcow2 code still reads zero regardless of whether
> there is a backing file).  One of those is allocated, the other is
> sparse.  And NBD_STATE_HOLE is a great fit for telling those two types
> of clusters apart.
>
> In short, qemu has 3 bits of interesting information per cluster:
> whether it reads as zero, whether it has allocated space, and whether
> it overrides a backing file.  The NBD protocol base:allocation only
> provides two of those bits of information (BDRV_BLOCK_DATA and
> BDRV_BLOCK_ZERO); to get at the third (BDRV_BLOCK_ALLOCATED), you HAVE
> to use a qemu-specific extension.
>
> That said, we CAN provide a _new_ qemu:block-status metacontext, which
> returns a bitmap of bit 0 (NBD_STATE_HOLE), bit 1 (NBD_STATE_ZERO),
> and bit 2 (BDRV_BLOCK_ALLOCATED) all from a single context, at which
> point you don't have to reconstruct the combination of base:allocation
> (two bits) and qemu:allocation-depth (an arbitrary integer) across
> possibly differing reporting lengths yourself.
>
> I'll post a counter-proposal patch along those lines, shortly.

Sounds great!

If I understand this correctly, for unallocated area in raw image,
and for zeroed cluster in qcow2 image, qemu-nbd find:

    BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED

So it will report:

    NBD_STATE_ZERO | NBD_STATE_HOLE | QEMU_STATE_ALLOCATED

And for unallocated area in qcow2 image, qemu-nbd will find:

    BDRV_BLOCK_ZERO

So it will report:

    NBD_STATE_ZERO | NBD_STATE_HOLE

So on the client side, only this can be treated as a hole.

This will work fine for handling unallocated areas in qcow2 images,
but not for holes in raw images. This is not critical for our use
case, but it wold
be nice if unallocated areas in raw image would not report
QEMU_STATE_ALLOCATED.

...
> > Since the original change introduced a regression breaking existing users, we
> > don't have good enough tests in this area, and is not clear how this
> > can be fixed
> > properly, I suggest to revert the original change for 6.0. Then we have time to
> > work on a proper fix upstream.
>
> 6.0 is already released upstream, so you are stuck with that new
> behavior.  And I think it is worse to ping-pong the behavior back to
> the way it was in 5.2 for 6.1, than to just keep the 6.0 behavior
> (which fixed a bug) unchanged going forward into 6.1 upstream.
>
> Downstream, we can backport and/or revert whatever patches we want to
> make the downstream ecosystem play nicely.  But that is an argument
> for downstream.

Reverting downstream can be good enough to avoid breaking existing
users when qemu-6.0.0 apreas in RHEL (like) distros.

> > What do you think?
>
> I'll go ahead and work up my counter-proposal patch to add yet another
> qemu:... context to make it easier for you to get combined status in
> one go.  I don't know whether it will be attractive enough to be worth
> incorporating upstream, but if it is, that is certainly another option
> that downstream can use, because it is easier to backport a new qemu:
> context than it is to play games with "which behavior will
> base:allocation have" - you are guaranteed that if qemu knows how to
> advertise the new context, it provides the information you want.

I think this is upstream worthy and a good direction to solve this issue.

Thanks,
Nir
Eric Blake June 10, 2021, 6:34 p.m. UTC | #6
On Tue, Jun 08, 2021 at 07:38:10PM +0300, Nir Soffer wrote:
> The example I provided was not detailed enough, what we actually do is:
> 
>     qemu-nbd .. 'json:{"driver": "qcow2", "backing": null, "file":
> {"driver": "file", "filename": "top.qcow2"}}'
> 
> So there is no backing chain and allocation depth is not relevant.
> - Allocated areas should be reported with flags 0
> - Zero areas which are not holes should be reported as NBD_STATE_ZERO
> - Zero areas which are holes (not allocated in this image) should be
> reported as NBD_STATE_HOLE

Thinking about this a bit more, here's something I noticed:

$ qemu-img map --output=json -f raw base.raw
[{ "start": 0, "length": 196608, "depth": 0, "zero": false, "data": true, "offset": 0},
{ "start": 196608, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": 196608}]

which matches what I've said elsewhere in this thread: the entire
image is reported as "depth":0 because the raw file is responsible for
100% of the content.

But:

$ qemu-img map --output=json -f qcow2 json:'{"driver":"qcow2","backing":null, \
  "file":{"driver":"file","filename":"top.qcow2"}}'
[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
{ "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
{ "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false}]

also reports the entire file at "depth":0, which is misleading, since
we have just been arguing from the qemu:allocation-depth perspective
(and also from bdrv_block_status) that the qcow2 image is NOT 100%
allocated (in the sense where allocation == data comes locally).
Perhaps it might be better if we tweaked the above qemu-img map to
produce:

[{ "start": 0, "length": 65536, "depth": -1, "zero": true, "data": false},
{ "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
{ "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
{ "start": 196608, "length": 65536, "depth": -1, "zero": true, "data": false}]

that is, use "depth":-1 to explicitly denote portions of a qcow2 file
which are NOT provided locally, and which are not found anywhere in
the backing chain.  In other words, make it explicit in qemu-img map
output what is possible with qemu:allocation-depth==0.

Or tweak it slightly to mean that "depth":-1 corresponds to "cluster
is not provided by the current layer, but we could not determine if it
is provided by a particular backing layer or if it was unallocated
overall".  Then positive depth means we know which point in the
backing chain we deferred to, 0 is local, and negative depth means
that we defer to a backing layer (but could not report WHICH layer, if
any).  This tweak would make it easier for my thoughts of having qemu
NBD clients automatically request qemu:allocation-depth without having
to resort to x-dirty-bitmap hacks, and still be able to expose the
information via qemu-img map.

I'm off to another round of code hacking to see how it looks...
Nir Soffer June 10, 2021, 8:09 p.m. UTC | #7
On Thu, Jun 10, 2021 at 9:35 PM Eric Blake <eblake@redhat.com> wrote:
>
> On Tue, Jun 08, 2021 at 07:38:10PM +0300, Nir Soffer wrote:
> > The example I provided was not detailed enough, what we actually do is:
> >
> >     qemu-nbd .. 'json:{"driver": "qcow2", "backing": null, "file":
> > {"driver": "file", "filename": "top.qcow2"}}'
> >
> > So there is no backing chain and allocation depth is not relevant.
> > - Allocated areas should be reported with flags 0
> > - Zero areas which are not holes should be reported as NBD_STATE_ZERO
> > - Zero areas which are holes (not allocated in this image) should be
> > reported as NBD_STATE_HOLE
>
> Thinking about this a bit more, here's something I noticed:
>
> $ qemu-img map --output=json -f raw base.raw
> [{ "start": 0, "length": 196608, "depth": 0, "zero": false, "data": true, "offset": 0},
> { "start": 196608, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": 196608}]
>
> which matches what I've said elsewhere in this thread: the entire
> image is reported as "depth":0 because the raw file is responsible for
> 100% of the content.
>
> But:
>
> $ qemu-img map --output=json -f qcow2 json:'{"driver":"qcow2","backing":null, \
>   "file":{"driver":"file","filename":"top.qcow2"}}'
> [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
> { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
> { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false}]
>
> also reports the entire file at "depth":0, which is misleading, since
> we have just been arguing from the qemu:allocation-depth perspective
> (and also from bdrv_block_status) that the qcow2 image is NOT 100%
> allocated (in the sense where allocation == data comes locally).
> Perhaps it might be better if we tweaked the above qemu-img map to
> produce:
>
> [{ "start": 0, "length": 65536, "depth": -1, "zero": true, "data": false},
> { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
> { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
> { "start": 196608, "length": 65536, "depth": -1, "zero": true, "data": false}]

It will be more consistent with "offset" to drop "depth" from output
if we don't have it:

    [{ "start": 0, "length": 65536, "zero": true, "data": false},
     { "start": 65536, "length": 65536, "depth": 0, "zero": false,
"data": true, "offset": 327680},
     { "start": 131072, "length": 65536, "depth": 0, "zero": true,
"data": false},
     { "start": 196608, "length": 65536, "zero": true, "data": false}]
Eric Blake June 10, 2021, 8:46 p.m. UTC | #8
On Thu, Jun 10, 2021 at 11:09:05PM +0300, Nir Soffer wrote:
> > But:
> >
> > $ qemu-img map --output=json -f qcow2 json:'{"driver":"qcow2","backing":null, \
> >   "file":{"driver":"file","filename":"top.qcow2"}}'
> > [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
> > { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
> > { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false}]
> >
> > also reports the entire file at "depth":0, which is misleading, since
> > we have just been arguing from the qemu:allocation-depth perspective
> > (and also from bdrv_block_status) that the qcow2 image is NOT 100%
> > allocated (in the sense where allocation == data comes locally).
> > Perhaps it might be better if we tweaked the above qemu-img map to
> > produce:
> >
> > [{ "start": 0, "length": 65536, "depth": -1, "zero": true, "data": false},
> > { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
> > { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
> > { "start": 196608, "length": 65536, "depth": -1, "zero": true, "data": false}]
> 
> It will be more consistent with "offset" to drop "depth" from output
> if we don't have it:
> 
>     [{ "start": 0, "length": 65536, "zero": true, "data": false},
>      { "start": 65536, "length": 65536, "depth": 0, "zero": false,
> "data": true, "offset": 327680},
>      { "start": 131072, "length": 65536, "depth": 0, "zero": true,
> "data": false},
>      { "start": 196608, "length": 65536, "zero": true, "data": false}]

Yes, that might work as well.  But we didn't previously document
depth to be optional.  Removing something from output risks breaking
more downstream tools that expect it to be non-optional, compared to
providing a new value.
Kevin Wolf June 11, 2021, 8:09 a.m. UTC | #9
Am 10.06.2021 um 22:46 hat Eric Blake geschrieben:
> On Thu, Jun 10, 2021 at 11:09:05PM +0300, Nir Soffer wrote:
> > > But:
> > >
> > > $ qemu-img map --output=json -f qcow2 json:'{"driver":"qcow2","backing":null, \
> > >   "file":{"driver":"file","filename":"top.qcow2"}}'
> > > [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
> > > { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
> > > { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false}]
> > >
> > > also reports the entire file at "depth":0, which is misleading, since
> > > we have just been arguing from the qemu:allocation-depth perspective
> > > (and also from bdrv_block_status) that the qcow2 image is NOT 100%
> > > allocated (in the sense where allocation == data comes locally).
> > > Perhaps it might be better if we tweaked the above qemu-img map to
> > > produce:
> > >
> > > [{ "start": 0, "length": 65536, "depth": -1, "zero": true, "data": false},
> > > { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
> > > { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
> > > { "start": 196608, "length": 65536, "depth": -1, "zero": true, "data": false}]
> > 
> > It will be more consistent with "offset" to drop "depth" from output
> > if we don't have it:
> > 
> >     [{ "start": 0, "length": 65536, "zero": true, "data": false},
> >      { "start": 65536, "length": 65536, "depth": 0, "zero": false,
> > "data": true, "offset": 327680},
> >      { "start": 131072, "length": 65536, "depth": 0, "zero": true,
> > "data": false},
> >      { "start": 196608, "length": 65536, "zero": true, "data": false}]
> 
> Yes, that might work as well.  But we didn't previously document
> depth to be optional.  Removing something from output risks breaking
> more downstream tools that expect it to be non-optional, compared to
> providing a new value.

A negative value isn't any less unexpected than a missing key. I don't
think any existing tool would be able to handle it. Encoding different
meanings in a single value isn't very QAPI-like either. Usually strings
that are parsed are the problem, but negative integers really isn't that
much different. I don't really like this solution.

Leaving out the depth feels like a better suggestion to me.

But anyway, this seems to only happen at the end of the backing chain.
So if the backing chain consistents of n images, why not report 'depth':
n + 1? So, in the above example, you would get 1. I think this has the
best chances of tools actually working correctly with the new output,
even though it's still not unlikely to break something.

Kevin
Vladimir Sementsov-Ogievskiy June 11, 2021, 8:14 a.m. UTC | #10
11.06.2021 11:09, Kevin Wolf wrote:
> Am 10.06.2021 um 22:46 hat Eric Blake geschrieben:
>> On Thu, Jun 10, 2021 at 11:09:05PM +0300, Nir Soffer wrote:
>>>> But:
>>>>
>>>> $ qemu-img map --output=json -f qcow2 json:'{"driver":"qcow2","backing":null, \
>>>>    "file":{"driver":"file","filename":"top.qcow2"}}'
>>>> [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
>>>> { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
>>>> { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false}]
>>>>
>>>> also reports the entire file at "depth":0, which is misleading, since
>>>> we have just been arguing from the qemu:allocation-depth perspective
>>>> (and also from bdrv_block_status) that the qcow2 image is NOT 100%
>>>> allocated (in the sense where allocation == data comes locally).
>>>> Perhaps it might be better if we tweaked the above qemu-img map to
>>>> produce:
>>>>
>>>> [{ "start": 0, "length": 65536, "depth": -1, "zero": true, "data": false},
>>>> { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
>>>> { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
>>>> { "start": 196608, "length": 65536, "depth": -1, "zero": true, "data": false}]
>>>
>>> It will be more consistent with "offset" to drop "depth" from output
>>> if we don't have it:
>>>
>>>      [{ "start": 0, "length": 65536, "zero": true, "data": false},
>>>       { "start": 65536, "length": 65536, "depth": 0, "zero": false,
>>> "data": true, "offset": 327680},
>>>       { "start": 131072, "length": 65536, "depth": 0, "zero": true,
>>> "data": false},
>>>       { "start": 196608, "length": 65536, "zero": true, "data": false}]
>>
>> Yes, that might work as well.  But we didn't previously document
>> depth to be optional.  Removing something from output risks breaking
>> more downstream tools that expect it to be non-optional, compared to
>> providing a new value.
> 
> A negative value isn't any less unexpected than a missing key. I don't
> think any existing tool would be able to handle it. Encoding different
> meanings in a single value isn't very QAPI-like either. Usually strings
> that are parsed are the problem, but negative integers really isn't that
> much different. I don't really like this solution.
> 
> Leaving out the depth feels like a better suggestion to me.
> 
> But anyway, this seems to only happen at the end of the backing chain.
> So if the backing chain consistents of n images, why not report 'depth':
> n + 1? So, in the above example, you would get 1. I think this has the
> best chances of tools actually working correctly with the new output,
> even though it's still not unlikely to break something.
> 

Did you consider just add a new field?

So, "depth" keeps its meaning "which level provides data".

And we add additional optional field like

absolutely-completely-absent: bool

Which is true if data is nowhere in the backing chain.
Nir Soffer June 11, 2021, 9:05 a.m. UTC | #11
> ‫ב-11 ביוני 2021, בשעה 11:14, ‏‏Vladimir Sementsov-Ogievskiy ‏<vsementsov@virtuozzo.com> כתב/ה:‬
> 
> 11.06.2021 11:09, Kevin Wolf wrote:
>> Am 10.06.2021 um 22:46 hat Eric Blake geschrieben:
>>>> On Thu, Jun 10, 2021 at 11:09:05PM +0300, Nir Soffer wrote:
>>>>>> But:
>>>>>> 
>>>>>> $ qemu-img map --output=json -f qcow2 json:'{"driver":"qcow2","backing":null, \
>>>>>>   "file":{"driver":"file","filename":"top.qcow2"}}'
>>>>>> [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
>>>>>> { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
>>>>>> { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false}]
>>>>>> 
>>>>>> also reports the entire file at "depth":0, which is misleading, since
>>>>>> we have just been arguing from the qemu:allocation-depth perspective
>>>>>> (and also from bdrv_block_status) that the qcow2 image is NOT 100%
>>>>>> allocated (in the sense where allocation == data comes locally).
>>>>>> Perhaps it might be better if we tweaked the above qemu-img map to
>>>>>> produce:
>>>>>> 
>>>>>> [{ "start": 0, "length": 65536, "depth": -1, "zero": true, "data": false},
>>>>>> { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
>>>>>> { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
>>>>>> { "start": 196608, "length": 65536, "depth": -1, "zero": true, "data": false}]
>>>>> 
>>>>> It will be more consistent with "offset" to drop "depth" from output
>>>>> if we don't have it:
>>>>> 
>>>>>     [{ "start": 0, "length": 65536, "zero": true, "data": false},
>>>>>      { "start": 65536, "length": 65536, "depth": 0, "zero": false,
>>>>> "data": true, "offset": 327680},
>>>>>      { "start": 131072, "length": 65536, "depth": 0, "zero": true,
>>>>> "data": false},
>>>>>      { "start": 196608, "length": 65536, "zero": true, "data": false}]
>>> 
>>> Yes, that might work as well.  But we didn't previously document
>>> depth to be optional.  Removing something from output risks breaking
>>> more downstream tools that expect it to be non-optional, compared to
>>> providing a new value.
>> A negative value isn't any less unexpected than a missing key. I don't
>> think any existing tool would be able to handle it. Encoding different
>> meanings in a single value isn't very QAPI-like either. Usually strings
>> that are parsed are the problem, but negative integers really isn't that
>> much different. I don't really like this solution.
>> Leaving out the depth feels like a better suggestion to me.
>> But anyway, this seems to only happen at the end of the backing chain.
>> So if the backing chain consistents of n images, why not report 'depth':
>> n + 1? So, in the above example, you would get 1. I think this has the
>> best chances of tools actually working correctly with the new output,
>> even though it's still not unlikely to break something.
> 
> Did you consider just add a new field?
> 
> So, "depth" keeps its meaning "which level provides data".
> 
> And we add additional optional field like
> 
> absolutely-completely-absent: bool

hole: bool?

> 
> Which is true if data is nowhere in the backing chain.
> 
> 
> -- 
> Best regards,
> Vladimir
Vladimir Sementsov-Ogievskiy June 11, 2021, 11:14 a.m. UTC | #12
11.06.2021 12:05, Nir Soffer wrote:
> 
> 
>> ‫ב-11 ביוני 2021, בשעה 11:14, ‏‏Vladimir Sementsov-Ogievskiy ‏<vsementsov@virtuozzo.com> כתב/ה:‬
>>
>> 11.06.2021 11:09, Kevin Wolf wrote:
>>> Am 10.06.2021 um 22:46 hat Eric Blake geschrieben:
>>>> On Thu, Jun 10, 2021 at 11:09:05PM +0300, Nir Soffer wrote:
>>>>>> But:
>>>>>>
>>>>>> $ qemu-img map --output=json -f qcow2 json:'{"driver":"qcow2","backing":null, \
>>>>>>   "file":{"driver":"file","filename":"top.qcow2"}}'
>>>>>> [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
>>>>>> { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
>>>>>> { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false}]
>>>>>>
>>>>>> also reports the entire file at "depth":0, which is misleading, since
>>>>>> we have just been arguing from the qemu:allocation-depth perspective
>>>>>> (and also from bdrv_block_status) that the qcow2 image is NOT 100%
>>>>>> allocated (in the sense where allocation == data comes locally).
>>>>>> Perhaps it might be better if we tweaked the above qemu-img map to
>>>>>> produce:
>>>>>>
>>>>>> [{ "start": 0, "length": 65536, "depth": -1, "zero": true, "data": false},
>>>>>> { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
>>>>>> { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
>>>>>> { "start": 196608, "length": 65536, "depth": -1, "zero": true, "data": false}]
>>>>>
>>>>> It will be more consistent with "offset" to drop "depth" from output
>>>>> if we don't have it:
>>>>>
>>>>>     [{ "start": 0, "length": 65536, "zero": true, "data": false},
>>>>>      { "start": 65536, "length": 65536, "depth": 0, "zero": false,
>>>>> "data": true, "offset": 327680},
>>>>>      { "start": 131072, "length": 65536, "depth": 0, "zero": true,
>>>>> "data": false},
>>>>>      { "start": 196608, "length": 65536, "zero": true, "data": false}]
>>>>
>>>> Yes, that might work as well.  But we didn't previously document
>>>> depth to be optional.  Removing something from output risks breaking
>>>> more downstream tools that expect it to be non-optional, compared to
>>>> providing a new value.
>>> A negative value isn't any less unexpected than a missing key. I don't
>>> think any existing tool would be able to handle it. Encoding different
>>> meanings in a single value isn't very QAPI-like either. Usually strings
>>> that are parsed are the problem, but negative integers really isn't that
>>> much different. I don't really like this solution.
>>> Leaving out the depth feels like a better suggestion to me.
>>> But anyway, this seems to only happen at the end of the backing chain.
>>> So if the backing chain consistents of n images, why not report 'depth':
>>> n + 1? So, in the above example, you would get 1. I think this has the
>>> best chances of tools actually working correctly with the new output,
>>> even though it's still not unlikely to break something.
>>
>> Did you consider just add a new field?
>>
>> So, "depth" keeps its meaning "which level provides data".
>>
>> And we add additional optional field like
>>
>> absolutely-completely-absent: bool
> 
> hole: bool?
> 

That messes-up with file-posix holes which are UNALLOCATED_ZERO..

I think, we should somehow start to honestly report backing chains and use "backing" concept in interfaces..

maybe
nobacking: bool

   May be true only together with data=false and zero=true, and means that all layers refer to backing for this region, but last layer just don't have backing image currently and therefore returns zeroes.
Kevin Wolf June 11, 2021, 11:21 a.m. UTC | #13
Am 11.06.2021 um 10:14 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 11.06.2021 11:09, Kevin Wolf wrote:
> > Am 10.06.2021 um 22:46 hat Eric Blake geschrieben:
> > > On Thu, Jun 10, 2021 at 11:09:05PM +0300, Nir Soffer wrote:
> > > > > But:
> > > > > 
> > > > > $ qemu-img map --output=json -f qcow2 json:'{"driver":"qcow2","backing":null, \
> > > > >    "file":{"driver":"file","filename":"top.qcow2"}}'
> > > > > [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
> > > > > { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
> > > > > { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false}]
> > > > > 
> > > > > also reports the entire file at "depth":0, which is misleading, since
> > > > > we have just been arguing from the qemu:allocation-depth perspective
> > > > > (and also from bdrv_block_status) that the qcow2 image is NOT 100%
> > > > > allocated (in the sense where allocation == data comes locally).
> > > > > Perhaps it might be better if we tweaked the above qemu-img map to
> > > > > produce:
> > > > > 
> > > > > [{ "start": 0, "length": 65536, "depth": -1, "zero": true, "data": false},
> > > > > { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
> > > > > { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
> > > > > { "start": 196608, "length": 65536, "depth": -1, "zero": true, "data": false}]
> > > > 
> > > > It will be more consistent with "offset" to drop "depth" from output
> > > > if we don't have it:
> > > > 
> > > >      [{ "start": 0, "length": 65536, "zero": true, "data": false},
> > > >       { "start": 65536, "length": 65536, "depth": 0, "zero": false,
> > > > "data": true, "offset": 327680},
> > > >       { "start": 131072, "length": 65536, "depth": 0, "zero": true,
> > > > "data": false},
> > > >       { "start": 196608, "length": 65536, "zero": true, "data": false}]
> > > 
> > > Yes, that might work as well.  But we didn't previously document
> > > depth to be optional.  Removing something from output risks breaking
> > > more downstream tools that expect it to be non-optional, compared to
> > > providing a new value.
> > 
> > A negative value isn't any less unexpected than a missing key. I don't
> > think any existing tool would be able to handle it. Encoding different
> > meanings in a single value isn't very QAPI-like either. Usually strings
> > that are parsed are the problem, but negative integers really isn't that
> > much different. I don't really like this solution.
> > 
> > Leaving out the depth feels like a better suggestion to me.
> > 
> > But anyway, this seems to only happen at the end of the backing chain.
> > So if the backing chain consistents of n images, why not report 'depth':
> > n + 1? So, in the above example, you would get 1. I think this has the
> > best chances of tools actually working correctly with the new output,
> > even though it's still not unlikely to break something.
> > 
> 
> Did you consider just add a new field?
> 
> So, "depth" keeps its meaning "which level provides data".
> 
> And we add additional optional field like
> 
> absolutely-completely-absent: bool
> 
> Which is true if data is nowhere in the backing chain.

Or how about exposing BDRV_BLOCK_ALLOCATED as 'allocated': 'bool'? Which
I think is what the conclusion was already for NBD, so doing the same in
'qemu-img map' would be consistent.

This is, of course, almost the same as 'absolutely-completely-absent',
just without the negating the flag.

Kevin
Vladimir Sementsov-Ogievskiy June 11, 2021, 1:04 p.m. UTC | #14
11.06.2021 14:21, Kevin Wolf wrote:
> Am 11.06.2021 um 10:14 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 11.06.2021 11:09, Kevin Wolf wrote:
>>> Am 10.06.2021 um 22:46 hat Eric Blake geschrieben:
>>>> On Thu, Jun 10, 2021 at 11:09:05PM +0300, Nir Soffer wrote:
>>>>>> But:
>>>>>>
>>>>>> $ qemu-img map --output=json -f qcow2 json:'{"driver":"qcow2","backing":null, \
>>>>>>     "file":{"driver":"file","filename":"top.qcow2"}}'
>>>>>> [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
>>>>>> { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
>>>>>> { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": false}]
>>>>>>
>>>>>> also reports the entire file at "depth":0, which is misleading, since
>>>>>> we have just been arguing from the qemu:allocation-depth perspective
>>>>>> (and also from bdrv_block_status) that the qcow2 image is NOT 100%
>>>>>> allocated (in the sense where allocation == data comes locally).
>>>>>> Perhaps it might be better if we tweaked the above qemu-img map to
>>>>>> produce:
>>>>>>
>>>>>> [{ "start": 0, "length": 65536, "depth": -1, "zero": true, "data": false},
>>>>>> { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": 327680},
>>>>>> { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": false},
>>>>>> { "start": 196608, "length": 65536, "depth": -1, "zero": true, "data": false}]
>>>>>
>>>>> It will be more consistent with "offset" to drop "depth" from output
>>>>> if we don't have it:
>>>>>
>>>>>       [{ "start": 0, "length": 65536, "zero": true, "data": false},
>>>>>        { "start": 65536, "length": 65536, "depth": 0, "zero": false,
>>>>> "data": true, "offset": 327680},
>>>>>        { "start": 131072, "length": 65536, "depth": 0, "zero": true,
>>>>> "data": false},
>>>>>        { "start": 196608, "length": 65536, "zero": true, "data": false}]
>>>>
>>>> Yes, that might work as well.  But we didn't previously document
>>>> depth to be optional.  Removing something from output risks breaking
>>>> more downstream tools that expect it to be non-optional, compared to
>>>> providing a new value.
>>>
>>> A negative value isn't any less unexpected than a missing key. I don't
>>> think any existing tool would be able to handle it. Encoding different
>>> meanings in a single value isn't very QAPI-like either. Usually strings
>>> that are parsed are the problem, but negative integers really isn't that
>>> much different. I don't really like this solution.
>>>
>>> Leaving out the depth feels like a better suggestion to me.
>>>
>>> But anyway, this seems to only happen at the end of the backing chain.
>>> So if the backing chain consistents of n images, why not report 'depth':
>>> n + 1? So, in the above example, you would get 1. I think this has the
>>> best chances of tools actually working correctly with the new output,
>>> even though it's still not unlikely to break something.
>>>
>>
>> Did you consider just add a new field?
>>
>> So, "depth" keeps its meaning "which level provides data".
>>
>> And we add additional optional field like
>>
>> absolutely-completely-absent: bool
>>
>> Which is true if data is nowhere in the backing chain.
> 
> Or how about exposing BDRV_BLOCK_ALLOCATED as 'allocated': 'bool'? Which
> I think is what the conclusion was already for NBD, so doing the same in
> 'qemu-img map' would be consistent.

"allocated" is historically ambiguous: we never know exactly does it mean "occupy space on disk" or "data (or zeroes) taken from this qcow2 image, not from backing".

Eric recently sent related patch to libnbd:

  [libnbd PATCH] info: Avoid ambiguous 'allocated' terminology in mapping

> 
> This is, of course, almost the same as 'absolutely-completely-absent',
> just without the negating the flag.
> 
> Kevin
>
Eric Blake June 11, 2021, 1:28 p.m. UTC | #15
On Fri, Jun 11, 2021 at 10:09:09AM +0200, Kevin Wolf wrote:
> > Yes, that might work as well.  But we didn't previously document
> > depth to be optional.  Removing something from output risks breaking
> > more downstream tools that expect it to be non-optional, compared to
> > providing a new value.
> 
> A negative value isn't any less unexpected than a missing key. I don't
> think any existing tool would be able to handle it. Encoding different
> meanings in a single value isn't very QAPI-like either. Usually strings
> that are parsed are the problem, but negative integers really isn't that
> much different. I don't really like this solution.
> 
> Leaving out the depth feels like a better suggestion to me.
> 
> But anyway, this seems to only happen at the end of the backing chain.
> So if the backing chain consistents of n images, why not report 'depth':
> n + 1? So, in the above example, you would get 1. I think this has the
> best chances of tools actually working correctly with the new output,
> even though it's still not unlikely to break something.

Ooh, I like that.  It is closer to reality - the file data really
comes from the next depth, even if we have no filename at that depth.
v2 of my patch coming up.
Eric Blake June 11, 2021, 1:31 p.m. UTC | #16
On Fri, Jun 11, 2021 at 01:21:45PM +0200, Kevin Wolf wrote:
> > Did you consider just add a new field?
> > 
> > So, "depth" keeps its meaning "which level provides data".
> > 
> > And we add additional optional field like
> > 
> > absolutely-completely-absent: bool
> > 
> > Which is true if data is nowhere in the backing chain.
> 
> Or how about exposing BDRV_BLOCK_ALLOCATED as 'allocated': 'bool'? Which
> I think is what the conclusion was already for NBD, so doing the same in
> 'qemu-img map' would be consistent.
> 
> This is, of course, almost the same as 'absolutely-completely-absent',
> just without the negating the flag.

If we want to bikeshed on a new name, I think "allocated" is going to
cause more confusion than it solves.  And "hole" is wrong.  Better
would be "backing":true for portions of the file that would derive
from a backing file, if a backing file had been present.

But that still feels like more work than just exposing n+1 in depth.
Nir Soffer June 11, 2021, 5:35 p.m. UTC | #17
On Fri, Jun 11, 2021 at 4:28 PM Eric Blake <eblake@redhat.com> wrote:
>
> On Fri, Jun 11, 2021 at 10:09:09AM +0200, Kevin Wolf wrote:
> > > Yes, that might work as well.  But we didn't previously document
> > > depth to be optional.  Removing something from output risks breaking
> > > more downstream tools that expect it to be non-optional, compared to
> > > providing a new value.
> >
> > A negative value isn't any less unexpected than a missing key. I don't
> > think any existing tool would be able to handle it. Encoding different
> > meanings in a single value isn't very QAPI-like either. Usually strings
> > that are parsed are the problem, but negative integers really isn't that
> > much different. I don't really like this solution.
> >
> > Leaving out the depth feels like a better suggestion to me.
> >
> > But anyway, this seems to only happen at the end of the backing chain.
> > So if the backing chain consistents of n images, why not report 'depth':
> > n + 1? So, in the above example, you would get 1. I think this has the
> > best chances of tools actually working correctly with the new output,
> > even though it's still not unlikely to break something.
>
> Ooh, I like that.  It is closer to reality - the file data really
> comes from the next depth, even if we have no filename at that depth.
> v2 of my patch coming up.

How do you know the number of the layer? this info is not presented in
qemu-img map output.

Users will have to run "qemu-img info --backing-chain" to understand the
output of qemu-img map.
Eric Blake June 11, 2021, 6:34 p.m. UTC | #18
On Fri, Jun 11, 2021 at 08:35:01PM +0300, Nir Soffer wrote:
> On Fri, Jun 11, 2021 at 4:28 PM Eric Blake <eblake@redhat.com> wrote:
> >
> > On Fri, Jun 11, 2021 at 10:09:09AM +0200, Kevin Wolf wrote:
> > > > Yes, that might work as well.  But we didn't previously document
> > > > depth to be optional.  Removing something from output risks breaking
> > > > more downstream tools that expect it to be non-optional, compared to
> > > > providing a new value.
> > >
> > > A negative value isn't any less unexpected than a missing key. I don't
> > > think any existing tool would be able to handle it. Encoding different
> > > meanings in a single value isn't very QAPI-like either. Usually strings
> > > that are parsed are the problem, but negative integers really isn't that
> > > much different. I don't really like this solution.
> > >
> > > Leaving out the depth feels like a better suggestion to me.
> > >
> > > But anyway, this seems to only happen at the end of the backing chain.
> > > So if the backing chain consistents of n images, why not report 'depth':
> > > n + 1? So, in the above example, you would get 1. I think this has the
> > > best chances of tools actually working correctly with the new output,
> > > even though it's still not unlikely to break something.
> >
> > Ooh, I like that.  It is closer to reality - the file data really
> > comes from the next depth, even if we have no filename at that depth.
> > v2 of my patch coming up.
> 
> How do you know the number of the layer? this info is not presented in
> qemu-img map output.

qemu-img map has two output formats.

In --output=human, areas of the disk reading as zero are elided (and
this happens to include ALL areas that were not allocated anywhere in
the chain); all other areas list the filename of the element in the
chain where the data was found.  This mode also fails if compression
or encryption prevents easy access to actual data.  In other words,
it's fragile, so no one uses it for anything programmatic, even though
it's the default.

In --output=json, no file names are output.  Instead, "depth":N tells
you how deep in the backing chain you must traverse to find the data.
"depth":0 is obvious: the file you mapped (other than the bug that
this patch is fixing where we mistakenly used "depth":0 also for
unallocated regions).  If you use "backing":null to force a 1-layer
depth, then "depth":1 is unambiguous meaning the (non-present) backing
file.

Otherwise, you do have a point: "depth":1 in isolation is ambiguous
between "not allocated anywhere in this 1-element chain" and
"allocated at the first backing file in this chain of length 2 or
more".  At which point you can indeed use "qemu-img info" to determine
the backing chain depth.  How painful is that extra step?  Does it
justify the addition of a new optional "backing":true to any portion
of the file that was beyond the end of the chain (and omit that line
for all other regions, rather than printing "backing":false)?

> 
> Users will have to run "qemu-img info --backing-chain" to understand the
> output of qemu-img map.

At any rate, it should be easy enough to output an additional field,
followup patch coming soon...
Nir Soffer June 11, 2021, 9:23 p.m. UTC | #19
On Fri, Jun 11, 2021 at 9:34 PM Eric Blake <eblake@redhat.com> wrote:
>
> On Fri, Jun 11, 2021 at 08:35:01PM +0300, Nir Soffer wrote:
> > On Fri, Jun 11, 2021 at 4:28 PM Eric Blake <eblake@redhat.com> wrote:
> > >
> > > On Fri, Jun 11, 2021 at 10:09:09AM +0200, Kevin Wolf wrote:
> > > > > Yes, that might work as well.  But we didn't previously document
> > > > > depth to be optional.  Removing something from output risks breaking
> > > > > more downstream tools that expect it to be non-optional, compared to
> > > > > providing a new value.
> > > >
> > > > A negative value isn't any less unexpected than a missing key. I don't
> > > > think any existing tool would be able to handle it. Encoding different
> > > > meanings in a single value isn't very QAPI-like either. Usually strings
> > > > that are parsed are the problem, but negative integers really isn't that
> > > > much different. I don't really like this solution.
> > > >
> > > > Leaving out the depth feels like a better suggestion to me.
> > > >
> > > > But anyway, this seems to only happen at the end of the backing chain.
> > > > So if the backing chain consistents of n images, why not report 'depth':
> > > > n + 1? So, in the above example, you would get 1. I think this has the
> > > > best chances of tools actually working correctly with the new output,
> > > > even though it's still not unlikely to break something.
> > >
> > > Ooh, I like that.  It is closer to reality - the file data really
> > > comes from the next depth, even if we have no filename at that depth.
> > > v2 of my patch coming up.
> >
> > How do you know the number of the layer? this info is not presented in
> > qemu-img map output.
...
> Otherwise, you do have a point: "depth":1 in isolation is ambiguous
> between "not allocated anywhere in this 1-element chain" and
> "allocated at the first backing file in this chain of length 2 or
> more".  At which point you can indeed use "qemu-img info" to determine
> the backing chain depth.  How painful is that extra step?  Does it
> justify the addition of a new optional "backing":true to any portion
> of the file that was beyond the end of the chain (and omit that line
> for all other regions, rather than printing "backing":false)?

Dealing with depth: N + 1 is not that painful, but also not great.

I think it is worth a little more effort, and it will save time in the long term
for users and for developers. Better APIs need simpler and shorter
documentation and require less support.

I'm not sure about backing: false, maybe absent: true to match libnbd?

Nir
Eric Blake June 11, 2021, 9:41 p.m. UTC | #20
On Sat, Jun 12, 2021 at 12:23:06AM +0300, Nir Soffer wrote:
> > Otherwise, you do have a point: "depth":1 in isolation is ambiguous
> > between "not allocated anywhere in this 1-element chain" and
> > "allocated at the first backing file in this chain of length 2 or
> > more".  At which point you can indeed use "qemu-img info" to determine
> > the backing chain depth.  How painful is that extra step?  Does it
> > justify the addition of a new optional "backing":true to any portion
> > of the file that was beyond the end of the chain (and omit that line
> > for all other regions, rather than printing "backing":false)?
> 
> Dealing with depth: N + 1 is not that painful, but also not great.
> 
> I think it is worth a little more effort, and it will save time in the long term
> for users and for developers. Better APIs need simpler and shorter
> documentation and require less support.
> 
> I'm not sure about backing: false, maybe absent: true to match libnbd?

In the patch [1], I did "backing":true if the cluster was not found in
the chain, and omitted the bool altogether when the cluster is
present.  If we like the name "absent":true better than
"backing":true, that's an easy change.  The libnbd change for nbdinfo
to report 'absent' instead of 'unallocated' has not yet been released,
so we have some leeway on naming choices.

[1] https://lists.gnu.org/archive/html/qemu-devel/2021-06/msg03067.html
diff mbox series

Patch

diff --git a/nbd/server.c b/nbd/server.c
index b60ebc3ab6..adf37905d5 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2127,8 +2127,12 @@  static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset,
             return ret;
         }
 
-        flags = (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE) |
-                (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
+        flags = (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
+
+        if (ret & BDRV_BLOCK_OFFSET_VALID)
+            flags |= (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE);
+        else
+            flags |= (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE);
 
         if (nbd_extent_array_add(ea, num, flags) < 0) {
             return 0;
diff --git a/qemu-img.c b/qemu-img.c
index a5993682aa..6808e12d87 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3039,7 +3039,9 @@  static int get_block_status(BlockDriverState *bs, int64_t offset,
     *e = (MapEntry) {
         .start = offset,
         .length = bytes,
-        .data = !!(ret & BDRV_BLOCK_DATA),
+        .data = !!(has_offset
+            ? ret & BDRV_BLOCK_DATA
+            : ret & BDRV_BLOCK_ALLOCATED),
         .zero = !!(ret & BDRV_BLOCK_ZERO),
         .offset = map,
         .has_offset = has_offset,