diff mbox series

[10/17] block: Add new BDRV_ZERO_OPEN flag

Message ID 20200131174436.2961874-11-eblake@redhat.com (mailing list archive)
State New, archived
Headers show
Series Improve qcow2 all-zero detection | expand

Commit Message

Eric Blake Jan. 31, 2020, 5:44 p.m. UTC
Knowing that a file reads as all zeroes when created is useful, but
limited in scope to drivers that can create images.  However, there
are also situations where pre-existing images can quickly be
determined to read as all zeroes, even when the image was not just
created by the same process.  The optimization used in qemu-img
convert to avoid a pre-zeroing pass on the destination is just as
useful in such a scenario.  As such, it is worth the block layer
adding another bit to bdrv_known_zeroes().

Note that while BDRV_ZERO_CREATE cannot chase through backing layers
(because it only applies at creation time, but the backing layer was
not created at the same time as the active layer being created), it IS
okay for BDRV_ZERO_OPEN to chase through layers (as long as all layers
currently read as zero, the image reads as zero).

Upcoming patches will update the qcow2, file-posix, and nbd drivers to
advertise the new bit when appropriate.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block.c               | 12 ++++++------
 include/block/block.h | 10 ++++++++++
 qemu-img.c            | 10 ++++++----
 3 files changed, 22 insertions(+), 10 deletions(-)

Comments

Eric Blake Jan. 31, 2020, 6:03 p.m. UTC | #1
On 1/31/20 11:44 AM, Eric Blake wrote:
> Knowing that a file reads as all zeroes when created is useful, but
> limited in scope to drivers that can create images.  However, there
> are also situations where pre-existing images can quickly be
> determined to read as all zeroes, even when the image was not just
> created by the same process.  The optimization used in qemu-img
> convert to avoid a pre-zeroing pass on the destination is just as
> useful in such a scenario.  As such, it is worth the block layer
> adding another bit to bdrv_known_zeroes().
> 
> Note that while BDRV_ZERO_CREATE cannot chase through backing layers
> (because it only applies at creation time, but the backing layer was
> not created at the same time as the active layer being created), it IS
> okay for BDRV_ZERO_OPEN to chase through layers (as long as all layers
> currently read as zero, the image reads as zero).
> 
> Upcoming patches will update the qcow2, file-posix, and nbd drivers to
> advertise the new bit when appropriate.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---

[Is it bad when I review my own patches?]

> +++ b/block.c
> @@ -5078,7 +5078,7 @@ int bdrv_known_zeroes_truncate(BlockDriverState *bs)
> 
>   int bdrv_known_zeroes(BlockDriverState *bs)
>   {
> -    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
> +    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE | BDRV_ZERO_OPEN;
> 
>       if (!bs->drv) {
>           return 0;
> @@ -5100,17 +5100,17 @@ int bdrv_known_zeroes(BlockDriverState *bs)
>        * ZERO_CREATE is not viable.  If the current layer is smaller
>        * than the backing layer, truncation may expose backing data,
>        * restricting ZERO_TRUNCATE; treat failure to query size in the
> -     * same manner.  Otherwise, we can trust the driver.
> +     * same manner.  For ZERO_OPEN, we insist that both backing and
> +     * current layer report the bit.
>        */
> -
>       if (bs->backing) {

Spurious line deletion caused by rebasing.


> +++ b/include/block/block.h
> @@ -105,6 +105,16 @@ typedef enum {
>        * for drivers that set .bdrv_co_truncate.
>        */
>       BDRV_ZERO_TRUNCATE      = 0x2,
> +
> +    /*
> +     * bdrv_known_zeroes() should include this bit if an image is
> +     * known to read as all zeroes when first opened; this bit should
> +     * not be relied on after any writes to the image.  This can be
> +     * set even if BDRV_ZERO_INIT is clear, but should only be set if

Rebasing snafu - I renamed that bit BDRV_ZERO_CREATE in patch 9.
Max Reitz Feb. 4, 2020, 5:34 p.m. UTC | #2
On 31.01.20 18:44, Eric Blake wrote:
> Knowing that a file reads as all zeroes when created is useful, but
> limited in scope to drivers that can create images.  However, there
> are also situations where pre-existing images can quickly be
> determined to read as all zeroes, even when the image was not just
> created by the same process.  The optimization used in qemu-img
> convert to avoid a pre-zeroing pass on the destination is just as
> useful in such a scenario.  As such, it is worth the block layer
> adding another bit to bdrv_known_zeroes().
> 
> Note that while BDRV_ZERO_CREATE cannot chase through backing layers
> (because it only applies at creation time, but the backing layer was
> not created at the same time as the active layer being created), it IS
> okay for BDRV_ZERO_OPEN to chase through layers (as long as all layers
> currently read as zero, the image reads as zero).
> 
> Upcoming patches will update the qcow2, file-posix, and nbd drivers to
> advertise the new bit when appropriate.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block.c               | 12 ++++++------
>  include/block/block.h | 10 ++++++++++
>  qemu-img.c            | 10 ++++++----
>  3 files changed, 22 insertions(+), 10 deletions(-)
> 
> diff --git a/block.c b/block.c
> index fac0813140aa..d68f527dc41f 100644
> --- a/block.c
> +++ b/block.c
> @@ -5078,7 +5078,7 @@ int bdrv_known_zeroes_truncate(BlockDriverState *bs)
> 
>  int bdrv_known_zeroes(BlockDriverState *bs)
>  {
> -    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
> +    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE | BDRV_ZERO_OPEN;
> 
>      if (!bs->drv) {
>          return 0;
> @@ -5100,17 +5100,17 @@ int bdrv_known_zeroes(BlockDriverState *bs)
>       * ZERO_CREATE is not viable.  If the current layer is smaller
>       * than the backing layer, truncation may expose backing data,
>       * restricting ZERO_TRUNCATE; treat failure to query size in the
> -     * same manner.  Otherwise, we can trust the driver.
> +     * same manner.  For ZERO_OPEN, we insist that both backing and
> +     * current layer report the bit.
>       */
> -
>      if (bs->backing) {
>          int64_t back = bdrv_getlength(bs->backing->bs);
>          int64_t curr = bdrv_getlength(bs);
> 
> -        if (back < 0 || curr < back) {
> -            return 0;
> +        mask = bdrv_known_zeroes(bs->backing->bs) & BDRV_ZERO_OPEN;
> +        if (back >= 0 && curr >= back) {
> +            mask |= BDRV_ZERO_TRUNCATE;
>          }
> -        mask = BDRV_ZERO_TRUNCATE;
>      }
> 
>      if (bs->drv->bdrv_known_zeroes) {
> diff --git a/include/block/block.h b/include/block/block.h
> index a6a227f50678..dafb8cc2bd80 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -105,6 +105,16 @@ typedef enum {
>       * for drivers that set .bdrv_co_truncate.
>       */
>      BDRV_ZERO_TRUNCATE      = 0x2,
> +
> +    /*
> +     * bdrv_known_zeroes() should include this bit if an image is
> +     * known to read as all zeroes when first opened; this bit should
> +     * not be relied on after any writes to the image.

Is there a good reason for this?  Because to me this screams like we are
going to check this flag without ensuring that the image has actually
not been written to yet.  So if it’s generally easy for drivers to stop
reporting this flag after a write, then maybe we should do so.

Max

>                                                          This can be
> +     * set even if BDRV_ZERO_INIT is clear, but should only be set if
> +     * making the determination is more efficient than looping over
> +     * block status for the image.
> +     */
> +    BDRV_ZERO_OPEN          = 0x4,
>  } BdrvZeroFlags;
> 
>  typedef struct BlockSizes {
> diff --git a/qemu-img.c b/qemu-img.c
> index e60217e6c382..c8519a74f738 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1985,10 +1985,12 @@ static int convert_do_copy(ImgConvertState *s)
>      int64_t sector_num = 0;
> 
>      /* Check whether we have zero initialisation or can get it efficiently */
> -    if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
> -        !s->target_has_backing) {
> -        s->has_zero_init = !!(bdrv_known_zeroes(blk_bs(s->target)) &
> -                              BDRV_ZERO_CREATE);
> +    if (!s->has_zero_init && s->min_sparse && !s->target_has_backing) {
> +        ret = bdrv_known_zeroes(blk_bs(s->target));
> +        if (ret & BDRV_ZERO_OPEN ||
> +            (s->target_is_new && ret & BDRV_ZERO_CREATE)) {
> +            s->has_zero_init = true;
> +        }
>      }
> 
>      if (!s->has_zero_init && !s->target_has_backing &&
>
Eric Blake Feb. 4, 2020, 5:50 p.m. UTC | #3
On 2/4/20 11:34 AM, Max Reitz wrote:

>> +++ b/include/block/block.h
>> @@ -105,6 +105,16 @@ typedef enum {
>>        * for drivers that set .bdrv_co_truncate.
>>        */
>>       BDRV_ZERO_TRUNCATE      = 0x2,
>> +
>> +    /*
>> +     * bdrv_known_zeroes() should include this bit if an image is
>> +     * known to read as all zeroes when first opened; this bit should
>> +     * not be relied on after any writes to the image.
> 
> Is there a good reason for this?  Because to me this screams like we are
> going to check this flag without ensuring that the image has actually
> not been written to yet.  So if it’s generally easy for drivers to stop
> reporting this flag after a write, then maybe we should do so.

In patch 15 (implementing things in qcow2), I actually wrote the driver 
to return live results, rather than just open-time results, in part 
because writing the bit to persistent storage in qcow2 means that the 
bit must be accurate, without relying on the block layer's help.

But my pending NBD patch (not posted yet, but will be soon), the 
proposal I'm making for the NBD protocol itself is just open-time, not 
live, and so it would be more work than necessary to make the NBD driver 
report live results.

But it seems like it should be easy enough to also patch the block layer 
itself to guarantee that callers of bdrv_known_zeroes() cannot see this 
bit set if the block layer has been used in any non-zero transaction, by 
repeating the same logic as used in qcow2 to kill the bit (any 
write/write_compressed/bdrv_copy clear the bit, any trim clears the bit 
if the driver does not guarantee trim reads as zero, any truncate clears 
the bit if the driver does not guarantee truncate reads as zero, etc). 
Basically, the block layer would cache the results of .bdrv_known_zeroes 
during .bdrv_co_open, bdrv_co_pwrite() and friends would update that 
cache, and and bdrv_known_zeroes() would report the cached value rather 
than a fresh call to .bdrv_known_zeroes.

Are we worried enough about clients of this interface to make the block 
layer more robust?  (From the maintenance standpoint, the more the block 
layer guarantees, the easier it is to write code that uses the block 
layer; but there is the counter-argument that making the block layer 
track whether an image has been modified means a [slight] penalty to 
every write request to update the boolean.)
Vladimir Sementsov-Ogievskiy Feb. 5, 2020, 8:39 a.m. UTC | #4
04.02.2020 20:50, Eric Blake wrote:
> On 2/4/20 11:34 AM, Max Reitz wrote:
> 
>>> +++ b/include/block/block.h
>>> @@ -105,6 +105,16 @@ typedef enum {
>>>        * for drivers that set .bdrv_co_truncate.
>>>        */
>>>       BDRV_ZERO_TRUNCATE      = 0x2,
>>> +
>>> +    /*
>>> +     * bdrv_known_zeroes() should include this bit if an image is
>>> +     * known to read as all zeroes when first opened; this bit should
>>> +     * not be relied on after any writes to the image.
>>
>> Is there a good reason for this?  Because to me this screams like we are
>> going to check this flag without ensuring that the image has actually
>> not been written to yet.  So if it’s generally easy for drivers to stop
>> reporting this flag after a write, then maybe we should do so.
> 
> In patch 15 (implementing things in qcow2), I actually wrote the driver to return live results, rather than just open-time results, in part because writing the bit to persistent storage in qcow2 means that the bit must be accurate, without relying on the block layer's help.
> 
> But my pending NBD patch (not posted yet, but will be soon), the proposal I'm making for the NBD protocol itself is just open-time, not live, and so it would be more work than necessary to make the NBD driver report live results.
> 
> But it seems like it should be easy enough to also patch the block layer itself to guarantee that callers of bdrv_known_zeroes() cannot see this bit set if the block layer has been used in any non-zero transaction, by repeating the same logic as used in qcow2 to kill the bit (any write/write_compressed/bdrv_copy clear the bit, any trim clears the bit if the driver does not guarantee trim reads as zero, any truncate clears the bit if the driver does not guarantee truncate reads as zero, etc). Basically, the block layer would cache the results of .bdrv_known_zeroes during .bdrv_co_open, bdrv_co_pwrite() and friends would update that cache, and and bdrv_known_zeroes() would report the cached value rather than a fresh call to .bdrv_known_zeroes.
> 
> Are we worried enough about clients of this interface to make the block layer more robust?  (From the maintenance standpoint, the more the block layer guarantees, the easier it is to write code that uses the block layer; but there is the counter-argument that making the block layer track whether an image has been modified means a [slight] penalty to every write request to update the boolean.)
> 

I'm for functions is_all_zero(), vs is_it_was_all_zeros_when_opened(). I never liked places in code where is_zero_init() used like is_disk_zero(), without any checks, that the drive was not modified, or even created by use.
Max Reitz Feb. 5, 2020, 5:26 p.m. UTC | #5
On 04.02.20 18:50, Eric Blake wrote:
> On 2/4/20 11:34 AM, Max Reitz wrote:
> 
>>> +++ b/include/block/block.h
>>> @@ -105,6 +105,16 @@ typedef enum {
>>>        * for drivers that set .bdrv_co_truncate.
>>>        */
>>>       BDRV_ZERO_TRUNCATE      = 0x2,
>>> +
>>> +    /*
>>> +     * bdrv_known_zeroes() should include this bit if an image is
>>> +     * known to read as all zeroes when first opened; this bit should
>>> +     * not be relied on after any writes to the image.
>>
>> Is there a good reason for this?  Because to me this screams like we are
>> going to check this flag without ensuring that the image has actually
>> not been written to yet.  So if it’s generally easy for drivers to stop
>> reporting this flag after a write, then maybe we should do so.
> 
> In patch 15 (implementing things in qcow2), I actually wrote the driver
> to return live results, rather than just open-time results, in part
> because writing the bit to persistent storage in qcow2 means that the
> bit must be accurate, without relying on the block layer's help.
> 
> But my pending NBD patch (not posted yet, but will be soon), the
> proposal I'm making for the NBD protocol itself is just open-time, not
> live, and so it would be more work than necessary to make the NBD driver
> report live results.
> 
> But it seems like it should be easy enough to also patch the block layer
> itself to guarantee that callers of bdrv_known_zeroes() cannot see this
> bit set if the block layer has been used in any non-zero transaction, by
> repeating the same logic as used in qcow2 to kill the bit (any
> write/write_compressed/bdrv_copy clear the bit, any trim clears the bit
> if the driver does not guarantee trim reads as zero, any truncate clears
> the bit if the driver does not guarantee truncate reads as zero, etc).
> Basically, the block layer would cache the results of .bdrv_known_zeroes
> during .bdrv_co_open, bdrv_co_pwrite() and friends would update that
> cache, and and bdrv_known_zeroes() would report the cached value rather
> than a fresh call to .bdrv_known_zeroes.

Sounds reasonable to me in generaly, but I’d prefer it to be fetched
on-demand rather than unconditionally in bdrv_open().

(I realize that this means a tri-state of “known false”, “known true”,
and “not yet queried”.)

> Are we worried enough about clients of this interface to make the block
> layer more robust?  (From the maintenance standpoint, the more the block
> layer guarantees, the easier it is to write code that uses the block
> layer; but there is the counter-argument that making the block layer
> track whether an image has been modified means a [slight] penalty to
> every write request to update the boolean.)

Just like Vladimir, I’m worried about repeating the same mistakes we
have before: That is, most places that called bdrv_has_zero_init() just
did so out of wishful thinking, hoping that it would do what they need
it to.  It didn’t.

Max
diff mbox series

Patch

diff --git a/block.c b/block.c
index fac0813140aa..d68f527dc41f 100644
--- a/block.c
+++ b/block.c
@@ -5078,7 +5078,7 @@  int bdrv_known_zeroes_truncate(BlockDriverState *bs)

 int bdrv_known_zeroes(BlockDriverState *bs)
 {
-    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
+    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE | BDRV_ZERO_OPEN;

     if (!bs->drv) {
         return 0;
@@ -5100,17 +5100,17 @@  int bdrv_known_zeroes(BlockDriverState *bs)
      * ZERO_CREATE is not viable.  If the current layer is smaller
      * than the backing layer, truncation may expose backing data,
      * restricting ZERO_TRUNCATE; treat failure to query size in the
-     * same manner.  Otherwise, we can trust the driver.
+     * same manner.  For ZERO_OPEN, we insist that both backing and
+     * current layer report the bit.
      */
-
     if (bs->backing) {
         int64_t back = bdrv_getlength(bs->backing->bs);
         int64_t curr = bdrv_getlength(bs);

-        if (back < 0 || curr < back) {
-            return 0;
+        mask = bdrv_known_zeroes(bs->backing->bs) & BDRV_ZERO_OPEN;
+        if (back >= 0 && curr >= back) {
+            mask |= BDRV_ZERO_TRUNCATE;
         }
-        mask = BDRV_ZERO_TRUNCATE;
     }

     if (bs->drv->bdrv_known_zeroes) {
diff --git a/include/block/block.h b/include/block/block.h
index a6a227f50678..dafb8cc2bd80 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -105,6 +105,16 @@  typedef enum {
      * for drivers that set .bdrv_co_truncate.
      */
     BDRV_ZERO_TRUNCATE      = 0x2,
+
+    /*
+     * bdrv_known_zeroes() should include this bit if an image is
+     * known to read as all zeroes when first opened; this bit should
+     * not be relied on after any writes to the image.  This can be
+     * set even if BDRV_ZERO_INIT is clear, but should only be set if
+     * making the determination is more efficient than looping over
+     * block status for the image.
+     */
+    BDRV_ZERO_OPEN          = 0x4,
 } BdrvZeroFlags;

 typedef struct BlockSizes {
diff --git a/qemu-img.c b/qemu-img.c
index e60217e6c382..c8519a74f738 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1985,10 +1985,12 @@  static int convert_do_copy(ImgConvertState *s)
     int64_t sector_num = 0;

     /* Check whether we have zero initialisation or can get it efficiently */
-    if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
-        !s->target_has_backing) {
-        s->has_zero_init = !!(bdrv_known_zeroes(blk_bs(s->target)) &
-                              BDRV_ZERO_CREATE);
+    if (!s->has_zero_init && s->min_sparse && !s->target_has_backing) {
+        ret = bdrv_known_zeroes(blk_bs(s->target));
+        if (ret & BDRV_ZERO_OPEN ||
+            (s->target_is_new && ret & BDRV_ZERO_CREATE)) {
+            s->has_zero_init = true;
+        }
     }

     if (!s->has_zero_init && !s->target_has_backing &&