diff mbox series

nbd: server: Report holes for raw images

Message ID 20210219160752.1826830-1-nsoffer@redhat.com (mailing list archive)
State New, archived
Headers show
Series nbd: server: Report holes for raw images | expand

Commit Message

Nir Soffer Feb. 19, 2021, 4:07 p.m. UTC
When querying image extents for raw image, qemu-nbd reports holes as
zero:

$ qemu-nbd -t -r -f raw empty-6g.raw

$ qemu-img map --output json nbd://localhost
[{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": true, "offset": 0}]

$ qemu-img map --output json empty-6g.raw
[{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}]

Turns out that qemu-img map reports a hole based on BDRV_BLOCK_DATA, but
nbd server reports a hole based on BDRV_BLOCK_ALLOCATED.

The NBD protocol says:

    NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and
    future writes to that area may cause fragmentation or encounter an
    NBD_ENOSPC error); if clear, the block is allocated or the server
    could not otherwise determine its status.

qemu-img manual says:

    whether the sectors contain actual data or not (boolean field data;
    if false, the sectors are either unallocated or stored as
    optimized all-zero clusters);

To me, data=false looks compatible with NBD_STATE_HOLE. From user point
of view, getting same results from qemu-nbd and qemu-img is more
important than being more correct about allocation status.

Changing nbd server to report holes using BDRV_BLOCK_DATA makes qemu-nbd
results compatible with qemu-img map:

$ qemu-img map --output json nbd://localhost
[{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}]

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
---
 nbd/server.c               | 4 ++--
 tests/qemu-iotests/241.out | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

Comments

Eric Blake Feb. 19, 2021, 4:42 p.m. UTC | #1
On 2/19/21 10:07 AM, Nir Soffer wrote:
> When querying image extents for raw image, qemu-nbd reports holes as
> zero:
> 
> $ qemu-nbd -t -r -f raw empty-6g.raw
> 
> $ qemu-img map --output json nbd://localhost
> [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": true, "offset": 0}]
> 
> $ qemu-img map --output json empty-6g.raw
> [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}]
> 
> Turns out that qemu-img map reports a hole based on BDRV_BLOCK_DATA, but
> nbd server reports a hole based on BDRV_BLOCK_ALLOCATED.
> 
> The NBD protocol says:
> 
>     NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and
>     future writes to that area may cause fragmentation or encounter an
>     NBD_ENOSPC error); if clear, the block is allocated or the server
>     could not otherwise determine its status.

Notoriously, lseek(SEEK_HOLE) cannot actually tell us about which
portions of a file are allocated (it can only reliably tell us which
portions of a file read as zero); you have to use the FIEMAP ioctl or
similar to learn about allocation status of a raw file, which is too
expensive to be useful.  The fact that SEEK_HOLE does not tell us about
NBD_STATE_HOLE, but instead is solely about NBD_STATE_ZERO, is therefore
a misnomer that doesn't make it any easier to reason about.

But this patch is also an indication that our meaning of
BDRV_BLOCK_ALLOCATED is confused; there has been a mismatch over time on
whether it represents whether something is allocated by occupying space
on disk (which as pointed out above is too expensive to determine for
files, but easy to determine for qcow2) or represents allocated at this
depth within a backing chain (makes no sense for protocol layers, but
total sense for format layers like qcow2).  At one point, Vladimir
audited the code base to try and reserve BDRV_BLOCK_ALLOCATED solely for
the latter meaning (which layer of a backing chain supplies the data,
regardless of whether that extent occupies reserved storage space).

So on that grounds alone, I am inclined to agree with you that using
BDRV_BLOCK_ALLOCATED to feed NBD_STATE_HOLE is wrong: whether something
comes from the current layer of the backing image is unrelated to
whether that extent is known to occupy reserved storage in the file
system.  Failure to sest NBD_STATE_HOLE is not fatal (the NBD protocol
specifically chose NBD_STATE_HOLE and NBD_STATE_ZERO to have default
values of 0 in the common case, where setting them to 1 allows some
optimizations, but where skipping the chance to set them to 1 because
determining the answer is too expensive, such as FIEMAP, is fine as well).

> 
> qemu-img manual says:
> 
>     whether the sectors contain actual data or not (boolean field data;
>     if false, the sectors are either unallocated or stored as
>     optimized all-zero clusters);
> 
> To me, data=false looks compatible with NBD_STATE_HOLE. From user point
> of view, getting same results from qemu-nbd and qemu-img is more
> important than being more correct about allocation status.

More to the point, here is our inconsistency:

In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE

In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA

The fact that we are not doing a round-trip conversion means that one of
the two places is wrong.  And your argument that the server side is
wrong makes sense to me.

> 
> Changing nbd server to report holes using BDRV_BLOCK_DATA makes qemu-nbd
> results compatible with qemu-img map:
> 
> $ qemu-img map --output json nbd://localhost
> [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}]
> 
> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> ---
>  nbd/server.c               | 4 ++--
>  tests/qemu-iotests/241.out | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)

Conflicts with my patch to make NBD_CMD_READ/NBD_CMD_BLOCK_STATUS obey
block alignment boundaries:
https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06010.html

Reviewed-by: Eric Blake <eblake@redhat.com>

I'll wait a few days for any other reviewer commentary before taking
this through my NBD tree.
Eric Blake Feb. 19, 2021, 4:58 p.m. UTC | #2
On 2/19/21 10:42 AM, Eric Blake wrote:

>> To me, data=false looks compatible with NBD_STATE_HOLE. From user point
>> of view, getting same results from qemu-nbd and qemu-img is more
>> important than being more correct about allocation status.
> 
> More to the point, here is our inconsistency:
> 
> In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE
> 
> In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA
> 
> The fact that we are not doing a round-trip conversion means that one of
> the two places is wrong.  And your argument that the server side is
> wrong makes sense to me.

In fact, when I went back and researched when this was introduced (see
commit e7b1948d51 in 2018), we may have been aware of the inconsistency
between client and server, but didn't make up our minds at the time:
https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg03465.html
"? Hm, don't remember, what we decided about DATA/HOLE flags mapping.."

> 
> I'll wait a few days for any other reviewer commentary before taking
> this through my NBD tree.
>
Vladimir Sementsov-Ogievskiy Feb. 25, 2021, 6:15 p.m. UTC | #3
19.02.2021 19:07, Nir Soffer wrote:
> When querying image extents for raw image, qemu-nbd reports holes as
> zero:
> 
> $ qemu-nbd -t -r -f raw empty-6g.raw
> 
> $ qemu-img map --output json nbd://localhost
> [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": true, "offset": 0}]
> 
> $ qemu-img map --output json empty-6g.raw
> [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}]
> 
> Turns out that qemu-img map reports a hole based on BDRV_BLOCK_DATA, but
> nbd server reports a hole based on BDRV_BLOCK_ALLOCATED.
> 
> The NBD protocol says:
> 
>      NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and
>      future writes to that area may cause fragmentation or encounter an
>      NBD_ENOSPC error); if clear, the block is allocated or the server
>      could not otherwise determine its status.

So, it makes sense to repot HOLE for lseek holes on raw files (as write may increase disk usage) and for qcow2 ZERO clusters: again, writing will lead to allocation of NORMAL cluster.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

> 
> qemu-img manual says:
> 
>      whether the sectors contain actual data or not (boolean field data;
>      if false, the sectors are either unallocated or stored as
>      optimized all-zero clusters);
> 
> To me, data=false looks compatible with NBD_STATE_HOLE. From user point
> of view, getting same results from qemu-nbd and qemu-img is more
> important than being more correct about allocation status.
> 
> Changing nbd server to report holes using BDRV_BLOCK_DATA makes qemu-nbd
> results compatible with qemu-img map:
> 
> $ qemu-img map --output json nbd://localhost
> [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}]
> 
> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> ---
>   nbd/server.c               | 4 ++--
>   tests/qemu-iotests/241.out | 4 ++--
>   2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index 7229f487d2..86a44a9b41 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -2087,8 +2087,8 @@ static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset,
>               return ret;
>           }
>   
> -        flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) |
> -                (ret & BDRV_BLOCK_ZERO      ? NBD_STATE_ZERO : 0);
> +        flags = (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE) |
> +                (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
>   
>           if (nbd_extent_array_add(ea, num, flags) < 0) {
>               return 0;
> diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
> index 75f9f465e5..3f8c173cc8 100644
> --- a/tests/qemu-iotests/241.out
> +++ b/tests/qemu-iotests/241.out
> @@ -5,7 +5,7 @@ QA output created by 241
>     size:  1024
>     min block: 1
>   [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
> -{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
> +{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
>   1 KiB (0x400) bytes     allocated at offset 0 bytes (0x0)
>   
>   === Exporting unaligned raw image, forced server sector alignment ===
> @@ -23,6 +23,6 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' and probing guessed
>     size:  1024
>     min block: 1
>   [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
> -{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
> +{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
>   1 KiB (0x400) bytes     allocated at offset 0 bytes (0x0)
>   *** done
>
Vladimir Sementsov-Ogievskiy Feb. 25, 2021, 6:50 p.m. UTC | #4
19.02.2021 19:58, Eric Blake wrote:
> On 2/19/21 10:42 AM, Eric Blake wrote:
> 
>>> To me, data=false looks compatible with NBD_STATE_HOLE. From user point
>>> of view, getting same results from qemu-nbd and qemu-img is more
>>> important than being more correct about allocation status.
>>
>> More to the point, here is our inconsistency:
>>
>> In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE
>>
>> In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA
>>
>> The fact that we are not doing a round-trip conversion means that one of
>> the two places is wrong.  And your argument that the server side is
>> wrong makes sense to me.
> 
> In fact, when I went back and researched when this was introduced (see
> commit e7b1948d51 in 2018), we may have been aware of the inconsistency
> between client and server, but didn't make up our minds at the time:
> https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg03465.html
> "? Hm, don't remember, what we decided about DATA/HOLE flags mapping.."
> 
>>
>> I'll wait a few days for any other reviewer commentary before taking
>> this through my NBD tree.
>>
> 


I can add the following.

First, link to my research of block_status in Qemu: https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg05136.html

And about HOLE and ZERO..

As I've noted in the research above, SCSI may return HOLE & !ZERO:

from SCSI:
Logical Block Provisioning Read Zeros (LBPRZ) bit
1     If the logical block provisioning read zeros (LBPRZ) bit is set to one, then, for an unmapped LBA specified by a read operation, the deviceserver shall send user data with all bits set to zero to the data-in buffer.
0     If the TPRZ bit is set to zero, then, for an unmapped LBA specified by a read operation, the device server may send user data with all bitsset to any value to the data-in buffer.

So we can have an unmapped area that can be read as any random data. Same thing can be said about null-co driver with read-zeroes=false

Also, qcow2 support ALLOCATED ZERO clusters which reads as zero but data is allocated - they are reasonable to report as ZERO & !HOLE

And of-course UNALLOCATED ZERO clusters in qcow2 and lseek-holes are reasonable to report as ZERO & HOLE,  because they reads as zero and "future writes to that area may cause fragmentation or encounter an NBD_ENOSPC"..

So, all combination are reasonable, we just need to fix Qemu NBD server to report correct statuses in all these cases.

It seems that ZERO/HOLE specification is a lot more reasonable than what we have with ZERO/DATA/ALLOCATED in Qemu, and may be true way is move internal block_status to use NBD terms.


And thanks for CCing me. Hmm, maybe, I'll suggest myself as co-maintainer for NBD?
Nir Soffer March 3, 2021, 9:45 p.m. UTC | #5
On Thu, Feb 25, 2021 at 8:51 PM Vladimir Sementsov-Ogievskiy <
vsementsov@virtuozzo.com> wrote:

> 19.02.2021 19:58, Eric Blake wrote:
> > On 2/19/21 10:42 AM, Eric Blake wrote:
> >
> >>> To me, data=false looks compatible with NBD_STATE_HOLE. From user point
> >>> of view, getting same results from qemu-nbd and qemu-img is more
> >>> important than being more correct about allocation status.
> >>
> >> More to the point, here is our inconsistency:
> >>
> >> In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE
> >>
> >> In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA
> >>
> >> The fact that we are not doing a round-trip conversion means that one of
> >> the two places is wrong.  And your argument that the server side is
> >> wrong makes sense to me.
> >
> > In fact, when I went back and researched when this was introduced (see
> > commit e7b1948d51 in 2018), we may have been aware of the inconsistency
> > between client and server, but didn't make up our minds at the time:
> > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg03465.html
> > "? Hm, don't remember, what we decided about DATA/HOLE flags mapping.."
> >
> >>
> >> I'll wait a few days for any other reviewer commentary before taking
> >> this through my NBD tree.
> >>
> >
>
>
> I can add the following.
>
> First, link to my research of block_status in Qemu:
> https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg05136.html
>
> And about HOLE and ZERO..
>
> As I've noted in the research above, SCSI may return HOLE & !ZERO:
>
> from SCSI:
> Logical Block Provisioning Read Zeros (LBPRZ) bit
> 1     If the logical block provisioning read zeros (LBPRZ) bit is set to
> one, then, for an unmapped LBA specified by a read operation, the
> deviceserver shall send user data with all bits set to zero to the data-in
> buffer.
> 0     If the TPRZ bit is set to zero, then, for an unmapped LBA specified
> by a read operation, the device server may send user data with all bitsset
> to any value to the data-in buffer.
>
> So we can have an unmapped area that can be read as any random data. Same
> thing can be said about null-co driver with read-zeroes=false
>
> Also, qcow2 support ALLOCATED ZERO clusters which reads as zero but data
> is allocated - they are reasonable to report as ZERO & !HOLE
>
> And of-course UNALLOCATED ZERO clusters in qcow2 and lseek-holes are
> reasonable to report as ZERO & HOLE,  because they reads as zero and
> "future writes to that area may cause fragmentation or encounter an
> NBD_ENOSPC"..
>
> So, all combination are reasonable, we just need to fix Qemu NBD server to
> report correct statuses in all these cases.
>
> It seems that ZERO/HOLE specification is a lot more reasonable than what
> we have with ZERO/DATA/ALLOCATED in Qemu, and may be true way is move
> internal block_status to use NBD terms.
>
>
> And thanks for CCing me. Hmm, maybe, I'll suggest myself as co-maintainer
> for NBD?


Kevin, Max, are you ok with this change?
Eric Blake March 3, 2021, 9:51 p.m. UTC | #6
On 3/3/21 3:45 PM, Nir Soffer wrote:

>>>>
>>>> I'll wait a few days for any other reviewer commentary before taking
>>>> this through my NBD tree.
>>>>

>>
>> And thanks for CCing me. Hmm, maybe, I'll suggest myself as co-maintainer
>> for NBD?

Vladimir, I'd be happy if you want to do that in a separate patch
(you're already a co-maintainer on block bitmaps, which are somewhat
related).

> 
> 
> Kevin, Max, are you ok with this change?

I guess that means I should send my NBD pull request sooner rather than
later, since it's been a few days with no other comments?
Kevin Wolf March 4, 2021, 12:22 p.m. UTC | #7
Am 25.02.2021 um 19:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 19.02.2021 19:58, Eric Blake wrote:
> > On 2/19/21 10:42 AM, Eric Blake wrote:
> > 
> > > > To me, data=false looks compatible with NBD_STATE_HOLE. From user point
> > > > of view, getting same results from qemu-nbd and qemu-img is more
> > > > important than being more correct about allocation status.
> > > 
> > > More to the point, here is our inconsistency:
> > > 
> > > In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE
> > > 
> > > In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA
> > > 
> > > The fact that we are not doing a round-trip conversion means that one of
> > > the two places is wrong.  And your argument that the server side is
> > > wrong makes sense to me.
> > 
> > In fact, when I went back and researched when this was introduced (see
> > commit e7b1948d51 in 2018), we may have been aware of the inconsistency
> > between client and server, but didn't make up our minds at the time:
> > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg03465.html
> > "? Hm, don't remember, what we decided about DATA/HOLE flags mapping.."
> > 
> > > 
> > > I'll wait a few days for any other reviewer commentary before taking
> > > this through my NBD tree.
> > > 
> > 
> 
> 
> I can add the following.
> 
> First, link to my research of block_status in Qemu:
> https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg05136.html
> 
> And about HOLE and ZERO..
> 
> As I've noted in the research above, SCSI may return HOLE & !ZERO:
> 
> from SCSI: Logical Block Provisioning Read Zeros (LBPRZ) bit 1     If
> the logical block provisioning read zeros (LBPRZ) bit is set to one,
> then, for an unmapped LBA specified by a read operation, the
> deviceserver shall send user data with all bits set to zero to the
> data-in buffer.  0     If the TPRZ bit is set to zero, then, for an
> unmapped LBA specified by a read operation, the device server may send
> user data with all bitsset to any value to the data-in buffer.
> 
> So we can have an unmapped area that can be read as any random data.
> Same thing can be said about null-co driver with read-zeroes=false
> 
> Also, qcow2 support ALLOCATED ZERO clusters which reads as zero but
> data is allocated - they are reasonable to report as ZERO & !HOLE
> 
> And of-course UNALLOCATED ZERO clusters in qcow2 and lseek-holes are
> reasonable to report as ZERO & HOLE,  because they reads as zero and
> "future writes to that area may cause fragmentation or encounter an
> NBD_ENOSPC"..
> 
> So, all combination are reasonable, we just need to fix Qemu NBD
> server to report correct statuses in all these cases.
> 
> It seems that ZERO/HOLE specification is a lot more reasonable than
> what we have with ZERO/DATA/ALLOCATED in Qemu, and may be true way is
> move internal block_status to use NBD terms.

Is there not a 1:1 correspondence between our internal flags and the NBD
ones? ZERO is exactly the same, and HOLE is the inversion of DATA.

ALLOCATED is important internally when finding the node in a backing
file chain that actually defines the content, but for a user it doesn't
make a difference. This is why it isn't exposed in NBD.

So I think both QEMU and NBD use the flags that make sense in the
respective context.

Kevin
diff mbox series

Patch

diff --git a/nbd/server.c b/nbd/server.c
index 7229f487d2..86a44a9b41 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2087,8 +2087,8 @@  static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset,
             return ret;
         }
 
-        flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) |
-                (ret & BDRV_BLOCK_ZERO      ? NBD_STATE_ZERO : 0);
+        flags = (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE) |
+                (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
 
         if (nbd_extent_array_add(ea, num, flags) < 0) {
             return 0;
diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
index 75f9f465e5..3f8c173cc8 100644
--- a/tests/qemu-iotests/241.out
+++ b/tests/qemu-iotests/241.out
@@ -5,7 +5,7 @@  QA output created by 241
   size:  1024
   min block: 1
 [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
-{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
+{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
 1 KiB (0x400) bytes     allocated at offset 0 bytes (0x0)
 
 === Exporting unaligned raw image, forced server sector alignment ===
@@ -23,6 +23,6 @@  WARNING: Image format was not specified for 'TEST_DIR/t.raw' and probing guessed
   size:  1024
   min block: 1
 [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
-{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
+{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
 1 KiB (0x400) bytes     allocated at offset 0 bytes (0x0)
 *** done