Message ID | 20210219160752.1826830-1-nsoffer@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | nbd: server: Report holes for raw images | expand |
On 2/19/21 10:07 AM, Nir Soffer wrote: > When querying image extents for raw image, qemu-nbd reports holes as > zero: > > $ qemu-nbd -t -r -f raw empty-6g.raw > > $ qemu-img map --output json nbd://localhost > [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": true, "offset": 0}] > > $ qemu-img map --output json empty-6g.raw > [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}] > > Turns out that qemu-img map reports a hole based on BDRV_BLOCK_DATA, but > nbd server reports a hole based on BDRV_BLOCK_ALLOCATED. > > The NBD protocol says: > > NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and > future writes to that area may cause fragmentation or encounter an > NBD_ENOSPC error); if clear, the block is allocated or the server > could not otherwise determine its status. Notoriously, lseek(SEEK_HOLE) cannot actually tell us about which portions of a file are allocated (it can only reliably tell us which portions of a file read as zero); you have to use the FIEMAP ioctl or similar to learn about allocation status of a raw file, which is too expensive to be useful. The fact that SEEK_HOLE does not tell us about NBD_STATE_HOLE, but instead is solely about NBD_STATE_ZERO, is therefore a misnomer that doesn't make it any easier to reason about. But this patch is also an indication that our meaning of BDRV_BLOCK_ALLOCATED is confused; there has been a mismatch over time on whether it represents whether something is allocated by occupying space on disk (which as pointed out above is too expensive to determine for files, but easy to determine for qcow2) or represents allocated at this depth within a backing chain (makes no sense for protocol layers, but total sense for format layers like qcow2). At one point, Vladimir audited the code base to try and reserve BDRV_BLOCK_ALLOCATED solely for the latter meaning (which layer of a backing chain supplies the data, regardless of whether that extent occupies reserved storage space). So on that grounds alone, I am inclined to agree with you that using BDRV_BLOCK_ALLOCATED to feed NBD_STATE_HOLE is wrong: whether something comes from the current layer of the backing image is unrelated to whether that extent is known to occupy reserved storage in the file system. Failure to sest NBD_STATE_HOLE is not fatal (the NBD protocol specifically chose NBD_STATE_HOLE and NBD_STATE_ZERO to have default values of 0 in the common case, where setting them to 1 allows some optimizations, but where skipping the chance to set them to 1 because determining the answer is too expensive, such as FIEMAP, is fine as well). > > qemu-img manual says: > > whether the sectors contain actual data or not (boolean field data; > if false, the sectors are either unallocated or stored as > optimized all-zero clusters); > > To me, data=false looks compatible with NBD_STATE_HOLE. From user point > of view, getting same results from qemu-nbd and qemu-img is more > important than being more correct about allocation status. More to the point, here is our inconsistency: In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA The fact that we are not doing a round-trip conversion means that one of the two places is wrong. And your argument that the server side is wrong makes sense to me. > > Changing nbd server to report holes using BDRV_BLOCK_DATA makes qemu-nbd > results compatible with qemu-img map: > > $ qemu-img map --output json nbd://localhost > [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}] > > Signed-off-by: Nir Soffer <nsoffer@redhat.com> > --- > nbd/server.c | 4 ++-- > tests/qemu-iotests/241.out | 4 ++-- > 2 files changed, 4 insertions(+), 4 deletions(-) Conflicts with my patch to make NBD_CMD_READ/NBD_CMD_BLOCK_STATUS obey block alignment boundaries: https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06010.html Reviewed-by: Eric Blake <eblake@redhat.com> I'll wait a few days for any other reviewer commentary before taking this through my NBD tree.
On 2/19/21 10:42 AM, Eric Blake wrote: >> To me, data=false looks compatible with NBD_STATE_HOLE. From user point >> of view, getting same results from qemu-nbd and qemu-img is more >> important than being more correct about allocation status. > > More to the point, here is our inconsistency: > > In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE > > In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA > > The fact that we are not doing a round-trip conversion means that one of > the two places is wrong. And your argument that the server side is > wrong makes sense to me. In fact, when I went back and researched when this was introduced (see commit e7b1948d51 in 2018), we may have been aware of the inconsistency between client and server, but didn't make up our minds at the time: https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg03465.html "? Hm, don't remember, what we decided about DATA/HOLE flags mapping.." > > I'll wait a few days for any other reviewer commentary before taking > this through my NBD tree. >
19.02.2021 19:07, Nir Soffer wrote: > When querying image extents for raw image, qemu-nbd reports holes as > zero: > > $ qemu-nbd -t -r -f raw empty-6g.raw > > $ qemu-img map --output json nbd://localhost > [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": true, "offset": 0}] > > $ qemu-img map --output json empty-6g.raw > [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}] > > Turns out that qemu-img map reports a hole based on BDRV_BLOCK_DATA, but > nbd server reports a hole based on BDRV_BLOCK_ALLOCATED. > > The NBD protocol says: > > NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and > future writes to that area may cause fragmentation or encounter an > NBD_ENOSPC error); if clear, the block is allocated or the server > could not otherwise determine its status. So, it makes sense to repot HOLE for lseek holes on raw files (as write may increase disk usage) and for qcow2 ZERO clusters: again, writing will lead to allocation of NORMAL cluster. Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > > qemu-img manual says: > > whether the sectors contain actual data or not (boolean field data; > if false, the sectors are either unallocated or stored as > optimized all-zero clusters); > > To me, data=false looks compatible with NBD_STATE_HOLE. From user point > of view, getting same results from qemu-nbd and qemu-img is more > important than being more correct about allocation status. > > Changing nbd server to report holes using BDRV_BLOCK_DATA makes qemu-nbd > results compatible with qemu-img map: > > $ qemu-img map --output json nbd://localhost > [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}] > > Signed-off-by: Nir Soffer <nsoffer@redhat.com> > --- > nbd/server.c | 4 ++-- > tests/qemu-iotests/241.out | 4 ++-- > 2 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/nbd/server.c b/nbd/server.c > index 7229f487d2..86a44a9b41 100644 > --- a/nbd/server.c > +++ b/nbd/server.c > @@ -2087,8 +2087,8 @@ static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset, > return ret; > } > > - flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) | > - (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0); > + flags = (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE) | > + (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0); > > if (nbd_extent_array_add(ea, num, flags) < 0) { > return 0; > diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out > index 75f9f465e5..3f8c173cc8 100644 > --- a/tests/qemu-iotests/241.out > +++ b/tests/qemu-iotests/241.out > @@ -5,7 +5,7 @@ QA output created by 241 > size: 1024 > min block: 1 > [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, > -{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}] > +{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] > 1 KiB (0x400) bytes allocated at offset 0 bytes (0x0) > > === Exporting unaligned raw image, forced server sector alignment === > @@ -23,6 +23,6 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' and probing guessed > size: 1024 > min block: 1 > [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, > -{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}] > +{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] > 1 KiB (0x400) bytes allocated at offset 0 bytes (0x0) > *** done >
19.02.2021 19:58, Eric Blake wrote: > On 2/19/21 10:42 AM, Eric Blake wrote: > >>> To me, data=false looks compatible with NBD_STATE_HOLE. From user point >>> of view, getting same results from qemu-nbd and qemu-img is more >>> important than being more correct about allocation status. >> >> More to the point, here is our inconsistency: >> >> In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE >> >> In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA >> >> The fact that we are not doing a round-trip conversion means that one of >> the two places is wrong. And your argument that the server side is >> wrong makes sense to me. > > In fact, when I went back and researched when this was introduced (see > commit e7b1948d51 in 2018), we may have been aware of the inconsistency > between client and server, but didn't make up our minds at the time: > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg03465.html > "? Hm, don't remember, what we decided about DATA/HOLE flags mapping.." > >> >> I'll wait a few days for any other reviewer commentary before taking >> this through my NBD tree. >> > I can add the following. First, link to my research of block_status in Qemu: https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg05136.html And about HOLE and ZERO.. As I've noted in the research above, SCSI may return HOLE & !ZERO: from SCSI: Logical Block Provisioning Read Zeros (LBPRZ) bit 1 If the logical block provisioning read zeros (LBPRZ) bit is set to one, then, for an unmapped LBA specified by a read operation, the deviceserver shall send user data with all bits set to zero to the data-in buffer. 0 If the TPRZ bit is set to zero, then, for an unmapped LBA specified by a read operation, the device server may send user data with all bitsset to any value to the data-in buffer. So we can have an unmapped area that can be read as any random data. Same thing can be said about null-co driver with read-zeroes=false Also, qcow2 support ALLOCATED ZERO clusters which reads as zero but data is allocated - they are reasonable to report as ZERO & !HOLE And of-course UNALLOCATED ZERO clusters in qcow2 and lseek-holes are reasonable to report as ZERO & HOLE, because they reads as zero and "future writes to that area may cause fragmentation or encounter an NBD_ENOSPC".. So, all combination are reasonable, we just need to fix Qemu NBD server to report correct statuses in all these cases. It seems that ZERO/HOLE specification is a lot more reasonable than what we have with ZERO/DATA/ALLOCATED in Qemu, and may be true way is move internal block_status to use NBD terms. And thanks for CCing me. Hmm, maybe, I'll suggest myself as co-maintainer for NBD?
On Thu, Feb 25, 2021 at 8:51 PM Vladimir Sementsov-Ogievskiy < vsementsov@virtuozzo.com> wrote: > 19.02.2021 19:58, Eric Blake wrote: > > On 2/19/21 10:42 AM, Eric Blake wrote: > > > >>> To me, data=false looks compatible with NBD_STATE_HOLE. From user point > >>> of view, getting same results from qemu-nbd and qemu-img is more > >>> important than being more correct about allocation status. > >> > >> More to the point, here is our inconsistency: > >> > >> In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE > >> > >> In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA > >> > >> The fact that we are not doing a round-trip conversion means that one of > >> the two places is wrong. And your argument that the server side is > >> wrong makes sense to me. > > > > In fact, when I went back and researched when this was introduced (see > > commit e7b1948d51 in 2018), we may have been aware of the inconsistency > > between client and server, but didn't make up our minds at the time: > > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg03465.html > > "? Hm, don't remember, what we decided about DATA/HOLE flags mapping.." > > > >> > >> I'll wait a few days for any other reviewer commentary before taking > >> this through my NBD tree. > >> > > > > > I can add the following. > > First, link to my research of block_status in Qemu: > https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg05136.html > > And about HOLE and ZERO.. > > As I've noted in the research above, SCSI may return HOLE & !ZERO: > > from SCSI: > Logical Block Provisioning Read Zeros (LBPRZ) bit > 1 If the logical block provisioning read zeros (LBPRZ) bit is set to > one, then, for an unmapped LBA specified by a read operation, the > deviceserver shall send user data with all bits set to zero to the data-in > buffer. > 0 If the TPRZ bit is set to zero, then, for an unmapped LBA specified > by a read operation, the device server may send user data with all bitsset > to any value to the data-in buffer. > > So we can have an unmapped area that can be read as any random data. Same > thing can be said about null-co driver with read-zeroes=false > > Also, qcow2 support ALLOCATED ZERO clusters which reads as zero but data > is allocated - they are reasonable to report as ZERO & !HOLE > > And of-course UNALLOCATED ZERO clusters in qcow2 and lseek-holes are > reasonable to report as ZERO & HOLE, because they reads as zero and > "future writes to that area may cause fragmentation or encounter an > NBD_ENOSPC".. > > So, all combination are reasonable, we just need to fix Qemu NBD server to > report correct statuses in all these cases. > > It seems that ZERO/HOLE specification is a lot more reasonable than what > we have with ZERO/DATA/ALLOCATED in Qemu, and may be true way is move > internal block_status to use NBD terms. > > > And thanks for CCing me. Hmm, maybe, I'll suggest myself as co-maintainer > for NBD? Kevin, Max, are you ok with this change?
On 3/3/21 3:45 PM, Nir Soffer wrote: >>>> >>>> I'll wait a few days for any other reviewer commentary before taking >>>> this through my NBD tree. >>>> >> >> And thanks for CCing me. Hmm, maybe, I'll suggest myself as co-maintainer >> for NBD? Vladimir, I'd be happy if you want to do that in a separate patch (you're already a co-maintainer on block bitmaps, which are somewhat related). > > > Kevin, Max, are you ok with this change? I guess that means I should send my NBD pull request sooner rather than later, since it's been a few days with no other comments?
Am 25.02.2021 um 19:50 hat Vladimir Sementsov-Ogievskiy geschrieben: > 19.02.2021 19:58, Eric Blake wrote: > > On 2/19/21 10:42 AM, Eric Blake wrote: > > > > > > To me, data=false looks compatible with NBD_STATE_HOLE. From user point > > > > of view, getting same results from qemu-nbd and qemu-img is more > > > > important than being more correct about allocation status. > > > > > > More to the point, here is our inconsistency: > > > > > > In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE > > > > > > In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA > > > > > > The fact that we are not doing a round-trip conversion means that one of > > > the two places is wrong. And your argument that the server side is > > > wrong makes sense to me. > > > > In fact, when I went back and researched when this was introduced (see > > commit e7b1948d51 in 2018), we may have been aware of the inconsistency > > between client and server, but didn't make up our minds at the time: > > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg03465.html > > "? Hm, don't remember, what we decided about DATA/HOLE flags mapping.." > > > > > > > > I'll wait a few days for any other reviewer commentary before taking > > > this through my NBD tree. > > > > > > > > I can add the following. > > First, link to my research of block_status in Qemu: > https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg05136.html > > And about HOLE and ZERO.. > > As I've noted in the research above, SCSI may return HOLE & !ZERO: > > from SCSI: Logical Block Provisioning Read Zeros (LBPRZ) bit 1 If > the logical block provisioning read zeros (LBPRZ) bit is set to one, > then, for an unmapped LBA specified by a read operation, the > deviceserver shall send user data with all bits set to zero to the > data-in buffer. 0 If the TPRZ bit is set to zero, then, for an > unmapped LBA specified by a read operation, the device server may send > user data with all bitsset to any value to the data-in buffer. > > So we can have an unmapped area that can be read as any random data. > Same thing can be said about null-co driver with read-zeroes=false > > Also, qcow2 support ALLOCATED ZERO clusters which reads as zero but > data is allocated - they are reasonable to report as ZERO & !HOLE > > And of-course UNALLOCATED ZERO clusters in qcow2 and lseek-holes are > reasonable to report as ZERO & HOLE, because they reads as zero and > "future writes to that area may cause fragmentation or encounter an > NBD_ENOSPC".. > > So, all combination are reasonable, we just need to fix Qemu NBD > server to report correct statuses in all these cases. > > It seems that ZERO/HOLE specification is a lot more reasonable than > what we have with ZERO/DATA/ALLOCATED in Qemu, and may be true way is > move internal block_status to use NBD terms. Is there not a 1:1 correspondence between our internal flags and the NBD ones? ZERO is exactly the same, and HOLE is the inversion of DATA. ALLOCATED is important internally when finding the node in a backing file chain that actually defines the content, but for a user it doesn't make a difference. This is why it isn't exposed in NBD. So I think both QEMU and NBD use the flags that make sense in the respective context. Kevin
diff --git a/nbd/server.c b/nbd/server.c index 7229f487d2..86a44a9b41 100644 --- a/nbd/server.c +++ b/nbd/server.c @@ -2087,8 +2087,8 @@ static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset, return ret; } - flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) | - (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0); + flags = (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE) | + (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0); if (nbd_extent_array_add(ea, num, flags) < 0) { return 0; diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out index 75f9f465e5..3f8c173cc8 100644 --- a/tests/qemu-iotests/241.out +++ b/tests/qemu-iotests/241.out @@ -5,7 +5,7 @@ QA output created by 241 size: 1024 min block: 1 [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, -{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}] +{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 1 KiB (0x400) bytes allocated at offset 0 bytes (0x0) === Exporting unaligned raw image, forced server sector alignment === @@ -23,6 +23,6 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' and probing guessed size: 1024 min block: 1 [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, -{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}] +{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] 1 KiB (0x400) bytes allocated at offset 0 bytes (0x0) *** done
When querying image extents for raw image, qemu-nbd reports holes as zero: $ qemu-nbd -t -r -f raw empty-6g.raw $ qemu-img map --output json nbd://localhost [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": true, "offset": 0}] $ qemu-img map --output json empty-6g.raw [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}] Turns out that qemu-img map reports a hole based on BDRV_BLOCK_DATA, but nbd server reports a hole based on BDRV_BLOCK_ALLOCATED. The NBD protocol says: NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and future writes to that area may cause fragmentation or encounter an NBD_ENOSPC error); if clear, the block is allocated or the server could not otherwise determine its status. qemu-img manual says: whether the sectors contain actual data or not (boolean field data; if false, the sectors are either unallocated or stored as optimized all-zero clusters); To me, data=false looks compatible with NBD_STATE_HOLE. From user point of view, getting same results from qemu-nbd and qemu-img is more important than being more correct about allocation status. Changing nbd server to report holes using BDRV_BLOCK_DATA makes qemu-nbd results compatible with qemu-img map: $ qemu-img map --output json nbd://localhost [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}] Signed-off-by: Nir Soffer <nsoffer@redhat.com> --- nbd/server.c | 4 ++-- tests/qemu-iotests/241.out | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-)