diff mbox series

[09/17] block: Refactor bdrv_has_zero_init{,_truncate}

Message ID 20200131174436.2961874-10-eblake@redhat.com (mailing list archive)
State New, archived
Headers show
Series Improve qcow2 all-zero detection | expand

Commit Message

Eric Blake Jan. 31, 2020, 5:44 p.m. UTC
Having two slightly-different function names for related purposes is
unwieldy, especially since I envision adding yet another notion of
zero support in an upcoming patch.  It doesn't help that
bdrv_has_zero_init() is a misleading name (I originally thought that a
driver could only return 1 when opening an already-existing image
known to be all zeroes; but in reality many drivers always return 1
because it only applies to a just-created image).  Refactor all uses
to instead have a single function that returns multiple bits of
information, with better naming and documentation.

No semantic change, although some of the changes (such as to qcow2.c)
require a careful reading to see how it remains the same.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block.c                    | 49 ++++++++++++++------------------------
 block/file-posix.c         |  3 +--
 block/file-win32.c         |  3 +--
 block/nfs.c                |  7 +++---
 block/parallels.c          |  4 ++--
 block/qcow.c               |  2 +-
 block/qcow2.c              | 10 ++++----
 block/qed.c                |  3 +--
 block/raw-format.c         | 12 +++-------
 block/rbd.c                |  3 +--
 block/sheepdog.c           |  9 +++----
 block/ssh.c                |  7 +++---
 block/vdi.c                |  8 +++----
 block/vhdx.c               | 16 ++++++-------
 block/vmdk.c               |  9 +++----
 block/vpc.c                |  8 +++----
 blockdev.c                 |  2 +-
 include/block/block.h      | 28 +++++++++++++++++++---
 include/block/block_int.h  | 15 ++----------
 qemu-img.c                 |  3 ++-
 tests/qemu-iotests/122     |  2 +-
 tests/qemu-iotests/188     |  2 +-
 tests/qemu-iotests/188.out |  2 +-
 23 files changed, 96 insertions(+), 111 deletions(-)

Comments

Vladimir Sementsov-Ogievskiy Feb. 4, 2020, 3:35 p.m. UTC | #1
31.01.2020 20:44, Eric Blake wrote:
> Having two slightly-different function names for related purposes is
> unwieldy, especially since I envision adding yet another notion of
> zero support in an upcoming patch.  It doesn't help that
> bdrv_has_zero_init() is a misleading name (I originally thought that a
> driver could only return 1 when opening an already-existing image
> known to be all zeroes; but in reality many drivers always return 1
> because it only applies to a just-created image).  Refactor all uses
> to instead have a single function that returns multiple bits of
> information, with better naming and documentation.

Sounds good

> 
> No semantic change, although some of the changes (such as to qcow2.c)
> require a careful reading to see how it remains the same.
> 

...

> diff --git a/include/block/block.h b/include/block/block.h
> index 6cd566324d95..a6a227f50678 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h

Hmm, header file in the middle of the patch, possibly you don't use
[diff]
     orderFile = scripts/git.orderfile

in git config.. Or it is broken.

> @@ -85,6 +85,28 @@ typedef enum {
>       BDRV_REQ_MASK               = 0x3ff,
>   } BdrvRequestFlags;
> 
> +typedef enum {
> +    /*
> +     * bdrv_known_zeroes() should include this bit if the contents of
> +     * a freshly-created image with no backing file reads as all
> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.

I understand that this is preexisting logic, but could I ask: why? What's wrong
if driver can guarantee that created file is all-zero, but is not sure about
file resizing? I agree that it's normal for these flags to have the same value,
but what is the reason for this restriction?..

So, the only possible combination of flags, when they differs, is create=0 and
truncate=1.. How is it possible?

> +     * Since this bit is only reliable at image creation, a driver may
> +     * return this bit even for existing images that do not currently
> +     * read as zero.
> +     */
> +    BDRV_ZERO_CREATE        = 0x1,
> +
> +    /*
> +     * bdrv_known_zeroes() should include this bit if growing an image
> +     * with PREALLOC_MODE_OFF (either with no backing file, or beyond
> +     * the size of the backing file) will read the new data as all
> +     * zeroes without any additional effort.  This bit only matters
> +     * for drivers that set .bdrv_co_truncate.
> +     */
> +    BDRV_ZERO_TRUNCATE      = 0x2,
> +} BdrvZeroFlags;
> +

...
Eric Blake Feb. 4, 2020, 3:49 p.m. UTC | #2
On 2/4/20 9:35 AM, Vladimir Sementsov-Ogievskiy wrote:
> 31.01.2020 20:44, Eric Blake wrote:
>> Having two slightly-different function names for related purposes is
>> unwieldy, especially since I envision adding yet another notion of
>> zero support in an upcoming patch.  It doesn't help that
>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>> driver could only return 1 when opening an already-existing image
>> known to be all zeroes; but in reality many drivers always return 1
>> because it only applies to a just-created image).  Refactor all uses
>> to instead have a single function that returns multiple bits of
>> information, with better naming and documentation.
> 
> Sounds good
> 
>>
>> No semantic change, although some of the changes (such as to qcow2.c)
>> require a careful reading to see how it remains the same.
>>
> 
> ...
> 
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 6cd566324d95..a6a227f50678 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
> 
> Hmm, header file in the middle of the patch, possibly you don't use
> [diff]
>      orderFile = scripts/git.orderfile
> 
> in git config.. Or it is broken.

I do have it set up, so I'm not sure why it didn't work as planned. 
I'll make sure v2 follows the order I intended.

> 
>> @@ -85,6 +85,28 @@ typedef enum {
>>       BDRV_REQ_MASK               = 0x3ff,
>>   } BdrvRequestFlags;
>>
>> +typedef enum {
>> +    /*
>> +     * bdrv_known_zeroes() should include this bit if the contents of
>> +     * a freshly-created image with no backing file reads as all
>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
> 
> I understand that this is preexisting logic, but could I ask: why? 
> What's wrong
> if driver can guarantee that created file is all-zero, but is not sure 
> about
> file resizing? I agree that it's normal for these flags to have the same 
> value,
> but what is the reason for this restriction?..

For _this_ patch, my goal is to preserve pre-existing practice. Where we 
think pre-existing practice is wrong, we can then improve it in other 
patches (see patch 6, for example).

I _think_ the reason for this original limitation is as follows: If an 
image can be resized, we could choose to perform 'create(size=0), 
truncate(size=final)' instead of 'create(size=final)', and we want to 
guarantee the same behavior. If truncation can't guarantee a zero read, 
then why is creation doing so?

But as I did not write the original patch, I would welcome Max's input 
with regards to the thought behind commit ceaca56f.

> 
> So, the only possible combination of flags, when they differs, is 
> create=0 and
> truncate=1.. How is it possible?

qcow2 had that mode, at least before patch 5.

> 
>> +     * Since this bit is only reliable at image creation, a driver may
>> +     * return this bit even for existing images that do not currently
>> +     * read as zero.
>> +     */
>> +    BDRV_ZERO_CREATE        = 0x1,
>> +
>> +    /*
>> +     * bdrv_known_zeroes() should include this bit if growing an image
>> +     * with PREALLOC_MODE_OFF (either with no backing file, or beyond
>> +     * the size of the backing file) will read the new data as all
>> +     * zeroes without any additional effort.  This bit only matters
>> +     * for drivers that set .bdrv_co_truncate.
>> +     */
>> +    BDRV_ZERO_TRUNCATE      = 0x2,
>> +} BdrvZeroFlags;
>> +
> 
> ...
> 
>
Vladimir Sementsov-Ogievskiy Feb. 4, 2020, 4:07 p.m. UTC | #3
04.02.2020 18:49, Eric Blake wrote:
> On 2/4/20 9:35 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 31.01.2020 20:44, Eric Blake wrote:
>>> Having two slightly-different function names for related purposes is
>>> unwieldy, especially since I envision adding yet another notion of
>>> zero support in an upcoming patch.  It doesn't help that
>>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>>> driver could only return 1 when opening an already-existing image
>>> known to be all zeroes; but in reality many drivers always return 1
>>> because it only applies to a just-created image).  Refactor all uses
>>> to instead have a single function that returns multiple bits of
>>> information, with better naming and documentation.
>>
>> Sounds good
>>
>>>
>>> No semantic change, although some of the changes (such as to qcow2.c)
>>> require a careful reading to see how it remains the same.
>>>
>>
>> ...
>>
>>> diff --git a/include/block/block.h b/include/block/block.h
>>> index 6cd566324d95..a6a227f50678 100644
>>> --- a/include/block/block.h
>>> +++ b/include/block/block.h
>>
>> Hmm, header file in the middle of the patch, possibly you don't use
>> [diff]
>>      orderFile = scripts/git.orderfile
>>
>> in git config.. Or it is broken.
> 
> I do have it set up, so I'm not sure why it didn't work as planned. I'll make sure v2 follows the order I intended.
> 
>>
>>> @@ -85,6 +85,28 @@ typedef enum {
>>>       BDRV_REQ_MASK               = 0x3ff,
>>>   } BdrvRequestFlags;
>>>
>>> +typedef enum {
>>> +    /*
>>> +     * bdrv_known_zeroes() should include this bit if the contents of
>>> +     * a freshly-created image with no backing file reads as all
>>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
>>
>> I understand that this is preexisting logic, but could I ask: why? What's wrong
>> if driver can guarantee that created file is all-zero, but is not sure about
>> file resizing? I agree that it's normal for these flags to have the same value,
>> but what is the reason for this restriction?..
> 
> For _this_ patch, my goal is to preserve pre-existing practice. Where we think pre-existing practice is wrong, we can then improve it in other patches (see patch 6, for example).

This is OK, of course, I'm just trying to understand existing logic.

> 
> I _think_ the reason for this original limitation is as follows: If an image can be resized, we could choose to perform 'create(size=0), truncate(size=final)' instead of 'create(size=final)', and we want to guarantee the same behavior. If truncation can't guarantee a zero read, then why is creation doing so?

If we want to guarantee the same behavior, we should restrict any difference between these flags :)

> 
> But as I did not write the original patch, I would welcome Max's input with regards to the thought behind commit ceaca56f.
> 
>>
>> So, the only possible combination of flags, when they differs, is create=0 and
>> truncate=1.. How is it possible?
> 
> qcow2 had that mode, at least before patch 5.

yes, it reported even for encrypted images truncate=1...

> 
>>
>>> +     * Since this bit is only reliable at image creation, a driver may
>>> +     * return this bit even for existing images that do not currently
>>> +     * read as zero.
>>> +     */
>>> +    BDRV_ZERO_CREATE        = 0x1,
>>> +
>>> +    /*
>>> +     * bdrv_known_zeroes() should include this bit if growing an image
>>> +     * with PREALLOC_MODE_OFF (either with no backing file, or beyond
>>> +     * the size of the backing file) will read the new data as all
>>> +     * zeroes without any additional effort.  This bit only matters
>>> +     * for drivers that set .bdrv_co_truncate.
>>> +     */
>>> +    BDRV_ZERO_TRUNCATE      = 0x2,
>>> +} BdrvZeroFlags;
>>> +
>>
>> ...
>>
>>
>
Max Reitz Feb. 4, 2020, 5:42 p.m. UTC | #4
On 04.02.20 16:35, Vladimir Sementsov-Ogievskiy wrote:
> 31.01.2020 20:44, Eric Blake wrote:
>> Having two slightly-different function names for related purposes is
>> unwieldy, especially since I envision adding yet another notion of
>> zero support in an upcoming patch.  It doesn't help that
>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>> driver could only return 1 when opening an already-existing image
>> known to be all zeroes; but in reality many drivers always return 1
>> because it only applies to a just-created image).  Refactor all uses
>> to instead have a single function that returns multiple bits of
>> information, with better naming and documentation.
> 
> Sounds good
> 
>>
>> No semantic change, although some of the changes (such as to qcow2.c)
>> require a careful reading to see how it remains the same.
>>
> 
> ...
> 
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 6cd566324d95..a6a227f50678 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
> 
> Hmm, header file in the middle of the patch, possibly you don't use
> [diff]
>     orderFile = scripts/git.orderfile
> 
> in git config.. Or it is broken.
> 
>> @@ -85,6 +85,28 @@ typedef enum {
>>       BDRV_REQ_MASK               = 0x3ff,
>>   } BdrvRequestFlags;
>>
>> +typedef enum {
>> +    /*
>> +     * bdrv_known_zeroes() should include this bit if the contents of
>> +     * a freshly-created image with no backing file reads as all
>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
> 
> I understand that this is preexisting logic, but could I ask: why?
> What's wrong
> if driver can guarantee that created file is all-zero, but is not sure
> about
> file resizing? I agree that it's normal for these flags to have the same
> value,
> but what is the reason for this restriction?..

If areas added by truncation (or growth, rather) are always zero, then
the file can always be created with size 0 and grown from there.  Thus,
images where truncation adds zeroed areas will generally always be zero
after creation.

> So, the only possible combination of flags, when they differs, is
> create=0 and
> truncate=1.. How is it possible?

For preallocated qcow2 images, it depends on the storage whether they
are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
to bdrv_has_zero_init() of s->data_file->bs.

But when you truncate them (with PREALLOC_MODE_OFF, as
BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
area is always going to be 0, regardless of initial preallocation.


I just noticed a bug there, though: Encrypted qcow2 images will not see
areas added through growth as 0.  Hence, qcow2’s
bdrv_has_zero_init_truncate() implementation should not return true
unconditionally, but only for unencrypted images.

Max
Eric Blake Feb. 4, 2020, 5:51 p.m. UTC | #5
On 2/4/20 11:42 AM, Max Reitz wrote:

>>
>> I understand that this is preexisting logic, but could I ask: why?
>> What's wrong
>> if driver can guarantee that created file is all-zero, but is not sure
>> about
>> file resizing? I agree that it's normal for these flags to have the same
>> value,
>> but what is the reason for this restriction?..
> 
> If areas added by truncation (or growth, rather) are always zero, then
> the file can always be created with size 0 and grown from there.  Thus,
> images where truncation adds zeroed areas will generally always be zero
> after creation.
> 
>> So, the only possible combination of flags, when they differs, is
>> create=0 and
>> truncate=1.. How is it possible?
> 
> For preallocated qcow2 images, it depends on the storage whether they
> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
> to bdrv_has_zero_init() of s->data_file->bs.
> 
> But when you truncate them (with PREALLOC_MODE_OFF, as
> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
> area is always going to be 0, regardless of initial preallocation.
> 
> 
> I just noticed a bug there, though: Encrypted qcow2 images will not see
> areas added through growth as 0.  Hence, qcow2’s
> bdrv_has_zero_init_truncate() implementation should not return true
> unconditionally, but only for unencrypted images.

Hence patch 5 earlier in the series :)
Max Reitz Feb. 4, 2020, 5:53 p.m. UTC | #6
On 31.01.20 18:44, Eric Blake wrote:
> Having two slightly-different function names for related purposes is
> unwieldy, especially since I envision adding yet another notion of
> zero support in an upcoming patch.  It doesn't help that
> bdrv_has_zero_init() is a misleading name (I originally thought that a
> driver could only return 1 when opening an already-existing image
> known to be all zeroes; but in reality many drivers always return 1
> because it only applies to a just-created image).

I don’t find it misleading, I just find it meaningless, which then makes
it open to interpretation (or maybe rather s/interpretation/wishful
thinking/).

> Refactor all uses
> to instead have a single function that returns multiple bits of
> information, with better naming and documentation.

It doesn’t make sense to me.  How exactly is it unwieldy?  In the sense
that we have to deal with multiple rather small implementation functions
rather than a big one per driver?  Actually, multiple small functions
sounds better to me – unless the three implementations share common code.

As for the callers, they only want a single flag out of the three, don’t
they?  If so, it doesn’t really matter for them.

In fact, I can imagine that drivers can trivially return
BDRV_ZERO_TRUNCATE information (because the preallocation mode is
fixed), whereas BDRV_ZERO_CREATE can be a bit more involved, and
BDRV_ZERO_OPEN could take even more time because some (constant-time)
inquiries have to be done.

And thus callers which just want the trivially obtainable
BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
even though they don’t care about that flag.

So I’d leave it as separate functions so drivers can feel free to have
implementations for BDRV_ZERO_OPEN that take more than mere microseconds
but that are more accurate.

(Or maybe if you really want it to be a single functions, callers could
pass the mask of flags they care about.  If all flags are trivially
obtainable, the implementations would then simply create their result
mask and & it with the caller-given mask.  For implementations where
some branches could take a bit more time, those branches are only taken
when the caller cares about the given flag.  But again, I don’t
necessarily think having a single function is more easily handleable
than three smaller ones.)

Max
Eric Blake Feb. 4, 2020, 7:03 p.m. UTC | #7
On 2/4/20 11:53 AM, Max Reitz wrote:
> On 31.01.20 18:44, Eric Blake wrote:
>> Having two slightly-different function names for related purposes is
>> unwieldy, especially since I envision adding yet another notion of
>> zero support in an upcoming patch.  It doesn't help that
>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>> driver could only return 1 when opening an already-existing image
>> known to be all zeroes; but in reality many drivers always return 1
>> because it only applies to a just-created image).
> 
> I don’t find it misleading, I just find it meaningless, which then makes
> it open to interpretation (or maybe rather s/interpretation/wishful
> thinking/).
> 
>> Refactor all uses
>> to instead have a single function that returns multiple bits of
>> information, with better naming and documentation.
> 
> It doesn’t make sense to me.  How exactly is it unwieldy?  In the sense
> that we have to deal with multiple rather small implementation functions
> rather than a big one per driver?  Actually, multiple small functions
> sounds better to me – unless the three implementations share common code.

Common code for dealing with encryption, backing files, and so on.  It 
felt like I had a lot of code repetition when keeping functions separate.

> 
> As for the callers, they only want a single flag out of the three, don’t
> they?  If so, it doesn’t really matter for them.

The qemu-img.c caller in patch 10 checks ZERO_CREATE | ZERO_OPEN, so we 
DO have situations of checking more than one bit, vs. needing two 
function calls.

> 
> In fact, I can imagine that drivers can trivially return
> BDRV_ZERO_TRUNCATE information (because the preallocation mode is
> fixed), whereas BDRV_ZERO_CREATE can be a bit more involved, and
> BDRV_ZERO_OPEN could take even more time because some (constant-time)
> inquiries have to be done.

In looking at the rest of the series, drivers were either completely 
trivial (in which case, declaring:

.bdrv_has_zero_init = bdrv_has_zero_init_1,
.bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,

was a lot wordier than the new:

.bdrv_known_zeroes = bdrv_known_zeroes_truncate,

), or completely spelled out but where both creation and truncation were 
determined in the same amount of effort.


> 
> And thus callers which just want the trivially obtainable
> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
> even though they don’t care about that flag.

True, but only to a minor extent; and the documentation mentions that 
the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind 
block_status loop.  Meanwhile, callers tend to only care about 
bdrv_known_zeroes() right after opening an image or right before 
resizing (not repeatedly during runtime); and you also argued elsewhere 
in this thread that it may be worth having the block layer cache 
BDRV_ZERO_OPEN and update the cache on any write, at which point, the 
expense in the driver callback really is a one-time call during 
bdrv_co_open().  And in that case, whether the one-time expense is done 
via a single function call or via three driver callbacks, the amount of 
work is the same; but the driver callback interface is easier if there 
is only one callback (similar to how bdrv_unallocated_blocks_are_zero() 
calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even 
though BlockDriverInfo tracks much more than that boolean).

In fact, it may be worth consolidating known zeroes support into 
BlockDriverInfo.

> 
> So I’d leave it as separate functions so drivers can feel free to have
> implementations for BDRV_ZERO_OPEN that take more than mere microseconds
> but that are more accurate.
> 
> (Or maybe if you really want it to be a single functions, callers could
> pass the mask of flags they care about.  If all flags are trivially
> obtainable, the implementations would then simply create their result
> mask and & it with the caller-given mask.  For implementations where
> some branches could take a bit more time, those branches are only taken
> when the caller cares about the given flag.  But again, I don’t
> necessarily think having a single function is more easily handleable
> than three smaller ones.)

Those are still viable options, but before I repaint the bikeshed along 
those lines, I'd at least like a review of whether the overall idea of 
having a notion of 'reads-all-zeroes' is indeed useful enough, 
regardless of how we implement it as one vs. three driver callbacks.
Vladimir Sementsov-Ogievskiy Feb. 5, 2020, 7:51 a.m. UTC | #8
04.02.2020 20:42, Max Reitz wrote:
> On 04.02.20 16:35, Vladimir Sementsov-Ogievskiy wrote:
>> 31.01.2020 20:44, Eric Blake wrote:
>>> Having two slightly-different function names for related purposes is
>>> unwieldy, especially since I envision adding yet another notion of
>>> zero support in an upcoming patch.  It doesn't help that
>>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>>> driver could only return 1 when opening an already-existing image
>>> known to be all zeroes; but in reality many drivers always return 1
>>> because it only applies to a just-created image).  Refactor all uses
>>> to instead have a single function that returns multiple bits of
>>> information, with better naming and documentation.
>>
>> Sounds good
>>
>>>
>>> No semantic change, although some of the changes (such as to qcow2.c)
>>> require a careful reading to see how it remains the same.
>>>
>>
>> ...
>>
>>> diff --git a/include/block/block.h b/include/block/block.h
>>> index 6cd566324d95..a6a227f50678 100644
>>> --- a/include/block/block.h
>>> +++ b/include/block/block.h
>>
>> Hmm, header file in the middle of the patch, possibly you don't use
>> [diff]
>>      orderFile = scripts/git.orderfile
>>
>> in git config.. Or it is broken.
>>
>>> @@ -85,6 +85,28 @@ typedef enum {
>>>        BDRV_REQ_MASK               = 0x3ff,
>>>    } BdrvRequestFlags;
>>>
>>> +typedef enum {
>>> +    /*
>>> +     * bdrv_known_zeroes() should include this bit if the contents of
>>> +     * a freshly-created image with no backing file reads as all
>>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
>>
>> I understand that this is preexisting logic, but could I ask: why?
>> What's wrong
>> if driver can guarantee that created file is all-zero, but is not sure
>> about
>> file resizing? I agree that it's normal for these flags to have the same
>> value,
>> but what is the reason for this restriction?..
> 
> If areas added by truncation (or growth, rather) are always zero, then
> the file can always be created with size 0 and grown from there.  Thus,
> images where truncation adds zeroed areas will generally always be zero
> after creation.

This means, that if truncation bit is set, than create bit should be set.. But
here we say that if truncation is clear, than create bit must be clear.

> 
>> So, the only possible combination of flags, when they differs, is
>> create=0 and
>> truncate=1.. How is it possible?
> 
> For preallocated qcow2 images, it depends on the storage whether they
> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
> to bdrv_has_zero_init() of s->data_file->bs.
> 
> But when you truncate them (with PREALLOC_MODE_OFF, as
> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
> area is always going to be 0, regardless of initial preallocation.

ah yes, due to qcow2 zero clusters.

> 
> 
> I just noticed a bug there, though: Encrypted qcow2 images will not see
> areas added through growth as 0.  Hence, qcow2’s
> bdrv_has_zero_init_truncate() implementation should not return true
> unconditionally, but only for unencrypted images.
> 
> Max
>
Eric Blake Feb. 5, 2020, 2:07 p.m. UTC | #9
On 2/5/20 1:51 AM, Vladimir Sementsov-Ogievskiy wrote:

>>>> +typedef enum {
>>>> +    /*
>>>> +     * bdrv_known_zeroes() should include this bit if the contents of
>>>> +     * a freshly-created image with no backing file reads as all
>>>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>>>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
>>>
>>> I understand that this is preexisting logic, but could I ask: why?
>>> What's wrong
>>> if driver can guarantee that created file is all-zero, but is not sure
>>> about
>>> file resizing? I agree that it's normal for these flags to have the same
>>> value,
>>> but what is the reason for this restriction?..
>>
>> If areas added by truncation (or growth, rather) are always zero, then
>> the file can always be created with size 0 and grown from there.  Thus,
>> images where truncation adds zeroed areas will generally always be zero
>> after creation.
> 
> This means, that if truncation bit is set, than create bit should be 
> set.. But
> here we say that if truncation is clear, than create bit must be clear.

Max, did we get the logic backwards?

> 
>>
>>> So, the only possible combination of flags, when they differs, is
>>> create=0 and
>>> truncate=1.. How is it possible?
>>
>> For preallocated qcow2 images, it depends on the storage whether they
>> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
>> to bdrv_has_zero_init() of s->data_file->bs.
>>
>> But when you truncate them (with PREALLOC_MODE_OFF, as
>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>> area is always going to be 0, regardless of initial preallocation.
> 
> ah yes, due to qcow2 zero clusters.

Hmm. Do we actually set the zero flag on unallocated clusters when 
resizing a qcow2 image?  That would be an O(n) operation (we have to 
visit the L2 entry for each added cluster, even if only to set the zero 
cluster bit).  Or do we instead just rely on the fact that qcow2 is 
inherently sparse, and that when you resize the guest-visible size 
without writing any new clusters, then it is only subsequent guest 
access to those addresses that finally allocate clusters, making resize 
O(1) (update the qcow2 metadata cluster, but not any L2 tables) while 
still reading 0 from the new data.  To some extent, that's what the 
allocation mode is supposed to control.

What about with external data images, where a resize in guest-visible 
length requires a resize of the underlying data image?  There, we DO 
have to worry about whether the data image resizes with zeroes (as in 
the filesystem) or with random data (as in a block device).
Vladimir Sementsov-Ogievskiy Feb. 5, 2020, 2:25 p.m. UTC | #10
05.02.2020 17:07, Eric Blake wrote:
> On 2/5/20 1:51 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>>>> +typedef enum {
>>>>> +    /*
>>>>> +     * bdrv_known_zeroes() should include this bit if the contents of
>>>>> +     * a freshly-created image with no backing file reads as all
>>>>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>>>>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
>>>>
>>>> I understand that this is preexisting logic, but could I ask: why?
>>>> What's wrong
>>>> if driver can guarantee that created file is all-zero, but is not sure
>>>> about
>>>> file resizing? I agree that it's normal for these flags to have the same
>>>> value,
>>>> but what is the reason for this restriction?..
>>>
>>> If areas added by truncation (or growth, rather) are always zero, then
>>> the file can always be created with size 0 and grown from there.  Thus,
>>> images where truncation adds zeroed areas will generally always be zero
>>> after creation.
>>
>> This means, that if truncation bit is set, than create bit should be set.. But
>> here we say that if truncation is clear, than create bit must be clear.
> 
> Max, did we get the logic backwards?
> 
>>
>>>
>>>> So, the only possible combination of flags, when they differs, is
>>>> create=0 and
>>>> truncate=1.. How is it possible?
>>>
>>> For preallocated qcow2 images, it depends on the storage whether they
>>> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
>>> to bdrv_has_zero_init() of s->data_file->bs.
>>>
>>> But when you truncate them (with PREALLOC_MODE_OFF, as
>>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>>> area is always going to be 0, regardless of initial preallocation.
>>
>> ah yes, due to qcow2 zero clusters.
> 
> Hmm. Do we actually set the zero flag on unallocated clusters when resizing a qcow2 image?  That would be an O(n) operation (we have to visit the L2 entry for each added cluster, even if only to set the zero cluster bit).  Or do we instead just rely on the fact that qcow2 is inherently sparse, and that when you resize the guest-visible size without writing any new clusters, then it is only subsequent guest access to those addresses that finally allocate clusters, making resize O(1) (update the qcow2 metadata cluster, but not any L2 tables) while still reading 0 from the new data.  To some extent, that's what the allocation mode is supposed to control.

We must mark as ZERO new cluster at least if there is a _larger_ backing file, to prevent data from backing file become available for the guest. But we don't do it. It's a bug and there is fixing series from Kevin, I've just pinged it:
"[PATCH for-4.2? v3 0/8] block: Fix resize (extending) of short overlays"

> 
> What about with external data images, where a resize in guest-visible length requires a resize of the underlying data image?  There, we DO have to worry about whether the data image resizes with zeroes (as in the filesystem) or with random data (as in a block device).
>
Eric Blake Feb. 5, 2020, 2:36 p.m. UTC | #11
On 2/5/20 8:25 AM, Vladimir Sementsov-Ogievskiy wrote:

>>>> But when you truncate them (with PREALLOC_MODE_OFF, as
>>>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>>>> area is always going to be 0, regardless of initial preallocation.
>>>
>>> ah yes, due to qcow2 zero clusters.
>>
>> Hmm. Do we actually set the zero flag on unallocated clusters when 
>> resizing a qcow2 image?  That would be an O(n) operation (we have to 
>> visit the L2 entry for each added cluster, even if only to set the 
>> zero cluster bit).  Or do we instead just rely on the fact that qcow2 
>> is inherently sparse, and that when you resize the guest-visible size 
>> without writing any new clusters, then it is only subsequent guest 
>> access to those addresses that finally allocate clusters, making 
>> resize O(1) (update the qcow2 metadata cluster, but not any L2 tables) 
>> while still reading 0 from the new data.  To some extent, that's what 
>> the allocation mode is supposed to control.
> 
> We must mark as ZERO new cluster at least if there is a _larger_ backing 
> file, to prevent data from backing file become available for the guest. 
> But we don't do it. It's a bug and there is fixing series from Kevin, 
> I've just pinged it:
> "[PATCH for-4.2? v3 0/8] block: Fix resize (extending) of short overlays"

There's a difference for a backing file larger than the qcow2 file, and 
the protocol layer larger than the qcow2 file.  Visually, with the 
following four nodes:

f1 [qcow2 format] <- f2 [qcow2 format]
   v                        v
p1 [file protocol]    p2 [file protocol]

If f1 is larger than f2, then resizing f2 without writing zero clusters 
leaks the data from f1 into f2.  The block layer knows this: prior to 
this series, bdrv_has_zero_init_truncate() returns 0 if bs->backing is 
present; and even in this series, see patch 6/17 which continues to 
force a 0 return rather than calling into the driver if the sizes are 
suspect.  But that is an uncommon corner case; in short, the qcow2 
callback .bdrv_has_zero_init_truncate is NOT reachable in that scenario, 
whether before or after this series.

On the other hand, if p2 is larger than f2, resizing f2 reads zeroes. 
That's because qcow2 HAS to add L2 mappings into p2 before data from p2 
can leak, but .bdrv_co_truncate(PREALLOC_MODE_OFF) does not add any L2 
mappings.  Thus, qcow2 blindly returning 1 for 
.bdrv_has_zero_init_truncate was correct (other than the anomaly of 
bs->encrypted, also fixed earlier in this series).
Max Reitz Feb. 5, 2020, 4:43 p.m. UTC | #12
On 04.02.20 18:51, Eric Blake wrote:
> On 2/4/20 11:42 AM, Max Reitz wrote:
> 
>>>
>>> I understand that this is preexisting logic, but could I ask: why?
>>> What's wrong
>>> if driver can guarantee that created file is all-zero, but is not sure
>>> about
>>> file resizing? I agree that it's normal for these flags to have the same
>>> value,
>>> but what is the reason for this restriction?..
>>
>> If areas added by truncation (or growth, rather) are always zero, then
>> the file can always be created with size 0 and grown from there.  Thus,
>> images where truncation adds zeroed areas will generally always be zero
>> after creation.
>>
>>> So, the only possible combination of flags, when they differs, is
>>> create=0 and
>>> truncate=1.. How is it possible?
>>
>> For preallocated qcow2 images, it depends on the storage whether they
>> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
>> to bdrv_has_zero_init() of s->data_file->bs.
>>
>> But when you truncate them (with PREALLOC_MODE_OFF, as
>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>> area is always going to be 0, regardless of initial preallocation.
>>
>>
>> I just noticed a bug there, though: Encrypted qcow2 images will not see
>> areas added through growth as 0.  Hence, qcow2’s
>> bdrv_has_zero_init_truncate() implementation should not return true
>> unconditionally, but only for unencrypted images.
> 
> Hence patch 5 earlier in the series :)

Ah, good. :-)

Max
Max Reitz Feb. 5, 2020, 5:22 p.m. UTC | #13
On 04.02.20 20:03, Eric Blake wrote:
> On 2/4/20 11:53 AM, Max Reitz wrote:
>> On 31.01.20 18:44, Eric Blake wrote:
>>> Having two slightly-different function names for related purposes is
>>> unwieldy, especially since I envision adding yet another notion of
>>> zero support in an upcoming patch.  It doesn't help that
>>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>>> driver could only return 1 when opening an already-existing image
>>> known to be all zeroes; but in reality many drivers always return 1
>>> because it only applies to a just-created image).
>>
>> I don’t find it misleading, I just find it meaningless, which then makes
>> it open to interpretation (or maybe rather s/interpretation/wishful
>> thinking/).
>>
>>> Refactor all uses
>>> to instead have a single function that returns multiple bits of
>>> information, with better naming and documentation.
>>
>> It doesn’t make sense to me.  How exactly is it unwieldy?  In the sense
>> that we have to deal with multiple rather small implementation functions
>> rather than a big one per driver?  Actually, multiple small functions
>> sounds better to me – unless the three implementations share common code.
> 
> Common code for dealing with encryption, backing files, and so on.  It
> felt like I had a lot of code repetition when keeping functions separate.

Well, I suppose “dealing with” means “if (encrypted || has_backing)”, so
duplicating that doesn’t seem too bad.

>> As for the callers, they only want a single flag out of the three, don’t
>> they?  If so, it doesn’t really matter for them.
> 
> The qemu-img.c caller in patch 10 checks ZERO_CREATE | ZERO_OPEN, so we
> DO have situations of checking more than one bit, vs. needing two
> function calls.

Hm, OK.  Not sure if that place would look worse with two function
calls, but, well.

>> In fact, I can imagine that drivers can trivially return
>> BDRV_ZERO_TRUNCATE information (because the preallocation mode is
>> fixed), whereas BDRV_ZERO_CREATE can be a bit more involved, and
>> BDRV_ZERO_OPEN could take even more time because some (constant-time)
>> inquiries have to be done.
> 
> In looking at the rest of the series, drivers were either completely
> trivial (in which case, declaring:
> 
> .bdrv_has_zero_init = bdrv_has_zero_init_1,
> .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
> 
> was a lot wordier than the new:
> 
> .bdrv_known_zeroes = bdrv_known_zeroes_truncate,

Not sure if that’s bad, though.

> ), or completely spelled out but where both creation and truncation were
> determined in the same amount of effort.

Well, usually, the effort is minimal, but OK.

>> And thus callers which just want the trivially obtainable
>> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
>> even though they don’t care about that flag.
> 
> True, but only to a minor extent; and the documentation mentions that
> the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
> block_status loop.

So it must be less expensive than an arbitrarily complex loop.  I think
a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?

What I’m trying to say is that this is not a good limit and can mean
anything.

I do think this limit definition makes sense for callers that want to
know about ZERO_OPEN.  But I don’t know why we would have to let other
callers wait, too.

> Meanwhile, callers tend to only care about
> bdrv_known_zeroes() right after opening an image or right before
> resizing (not repeatedly during runtime);

Hm, yes.  I was thinking of parallels, but that only checks once in
parallels_open(), so it’s OK.

> and you also argued elsewhere
> in this thread that it may be worth having the block layer cache
> BDRV_ZERO_OPEN and update the cache on any write,

I didn’t say the block layer, but it if makes sense.

> at which point, the
> expense in the driver callback really is a one-time call during
> bdrv_co_open().

It definitely doesn’t make sense to me to do that call unconditionally
in bdrv_co_open().

> And in that case, whether the one-time expense is done
> via a single function call or via three driver callbacks, the amount of
> work is the same; but the driver callback interface is easier if there
> is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
> calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
> though BlockDriverInfo tracks much more than that boolean).
> 
> In fact, it may be worth consolidating known zeroes support into
> BlockDriverInfo.

I’m very skeptical of that.  BDI already has the problem that it doesn’t
know which of the information the caller actually wants and that it is
sometimes used in a quasi-hot path.

Maybe that means it is indeed time to incorporate it into BDI, but the
caller should have a way of specifying what parts of BDI it actually
needs and then drivers can skip anything that isn’t trivially obtainable
that the caller doesn’t need.

>> So I’d leave it as separate functions so drivers can feel free to have
>> implementations for BDRV_ZERO_OPEN that take more than mere microseconds
>> but that are more accurate.
>>
>> (Or maybe if you really want it to be a single functions, callers could
>> pass the mask of flags they care about.  If all flags are trivially
>> obtainable, the implementations would then simply create their result
>> mask and & it with the caller-given mask.  For implementations where
>> some branches could take a bit more time, those branches are only taken
>> when the caller cares about the given flag.  But again, I don’t
>> necessarily think having a single function is more easily handleable
>> than three smaller ones.)
> 
> Those are still viable options, but before I repaint the bikeshed along
> those lines, I'd at least like a review of whether the overall idea of
> having a notion of 'reads-all-zeroes' is indeed useful enough,
> regardless of how we implement it as one vs. three driver callbacks.

I’m as hesitant as ever to give a review that this notion is useful,
because I haven’t seen a practical example yet where the problem isn’t
the fact that NBD doesn’t have 64-bit write_zeroes support.

So far, it looks to me like this notion is only really useful for cases
where we expect a management layer on top of qemu anyway.  And then I’m
not sure that this new feature works reliably enough for such a
management layer.

(I’m not saying it isn’t useful.  Again, intuitively it does seem
useful.  Intuition can be enough to merge a sufficiently simple series
that doesn’t increase code complexity too much.  But I’m still asking
for actual practical examples, because that would make a better
argument, of course.)

Max
Max Reitz Feb. 5, 2020, 5:55 p.m. UTC | #14
On 05.02.20 15:07, Eric Blake wrote:
> On 2/5/20 1:51 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>>>> +typedef enum {
>>>>> +    /*
>>>>> +     * bdrv_known_zeroes() should include this bit if the contents of
>>>>> +     * a freshly-created image with no backing file reads as all
>>>>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>>>>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
>>>>
>>>> I understand that this is preexisting logic, but could I ask: why?
>>>> What's wrong
>>>> if driver can guarantee that created file is all-zero, but is not sure
>>>> about
>>>> file resizing? I agree that it's normal for these flags to have the
>>>> same
>>>> value,
>>>> but what is the reason for this restriction?..
>>>
>>> If areas added by truncation (or growth, rather) are always zero, then
>>> the file can always be created with size 0 and grown from there.  Thus,
>>> images where truncation adds zeroed areas will generally always be zero
>>> after creation.
>>
>> This means, that if truncation bit is set, than create bit should be
>> set.. But
>> here we say that if truncation is clear, than create bit must be clear.
> 
> Max, did we get the logic backwards?

Or maybe my explanation was just wrong.

Because nobody actually forces a driver to use truncate to ensure that
an newly created file will be 0.  Hm.  And more importantly, you can’t
use truncate with PREALLOC_MODE_OFF when you want to create an image
with preallocation.

Let’s see.  The offending commit message says:

> No .bdrv_has_zero_init() implementation returns 1 if growing the file
> would add non-zero areas (at least with PREALLOC_MODE_OFF), so using it
> in lieu of this new function was always safe.
> 
> But on the other hand, it is possible that growing an image that is not
> zero-initialized would still add a zero-initialized area, like when
> using nonpreallocating truncation on a preallocated image.  For callers
> that care only about truncation, not about creation with potential
> preallocation, this new function is useful.

So I suppose the explanation is just the preallocation mode alone;
has_zero_init() is for the image’s actual preallocation mode, whereas
has_zero_init_truncate() is forced to PREALLOC_MODE_OFF.  As such, the
latter is less strict than the former.  So the former cannot be true
when the latter is false.

>>>> So, the only possible combination of flags, when they differs, is
>>>> create=0 and
>>>> truncate=1.. How is it possible?
>>>
>>> For preallocated qcow2 images, it depends on the storage whether they
>>> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
>>> to bdrv_has_zero_init() of s->data_file->bs.
>>>
>>> But when you truncate them (with PREALLOC_MODE_OFF, as
>>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>>> area is always going to be 0, regardless of initial preallocation.
>>
>> ah yes, due to qcow2 zero clusters.
> 
> Hmm. Do we actually set the zero flag on unallocated clusters when
> resizing a qcow2 image?

No.  They are just unallocated, i.e. zero.  (Nodes with backing files
never return true for bdrv_has_zero_init_truncate anyway).

> That would be an O(n) operation (we have to
> visit the L2 entry for each added cluster, even if only to set the zero
> cluster bit).  Or do we instead just rely on the fact that qcow2 is
> inherently sparse, and that when you resize the guest-visible size
> without writing any new clusters, then it is only subsequent guest
> access to those addresses that finally allocate clusters, making resize
> O(1) (update the qcow2 metadata cluster, but not any L2 tables) while
> still reading 0 from the new data.  To some extent, that's what the
> allocation mode is supposed to control.
> 
> What about with external data images, where a resize in guest-visible
> length requires a resize of the underlying data image?  There, we DO
> have to worry about whether the data image resizes with zeroes (as in
> the filesystem) or with random data (as in a block device).

Well, partially: Namely, only with data_file_raw.  Because otherwise the
clusters are still unallocated and thus read as zero.  So yes, then we
do have to worry about that.

With data_file_raw, we have an obligation to make the data file return
the same data as the qcow2 file, so, um.  I wonder whether we actually
take any care of this yet.  If you have some external data file without
zero_init(_truncate), do get zeroes when reading from the qcow2 node,
but non-zeroes when reading from the raw data file?  That would be OK
without data_file_raw, but not with it.  I suppose I’ll have to test it.

Max
Eric Blake Feb. 5, 2020, 6:39 p.m. UTC | #15
On 2/5/20 11:22 AM, Max Reitz wrote:

> 
>>> And thus callers which just want the trivially obtainable
>>> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
>>> even though they don’t care about that flag.
>>
>> True, but only to a minor extent; and the documentation mentions that
>> the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
>> block_status loop.
> 
> So it must be less expensive than an arbitrarily complex loop.  I think
> a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?

If I recall, the tmpfs bug was that it was O(n) where n was based on the 
initial offset and the number of extents prior to that offset.  The 
probe at offset 0 is O(1) (because there are no prior extents), whether 
it reaches the end of the file (the entire image is a hole) or hits data 
beforehand.  It is only probes at later offsets where the speed penalty 
sets in, and where an O(n) loop over all extents turned into an O(n^2) 
traversal time due to the O(n) nature of each later lookup.

> 
> What I’m trying to say is that this is not a good limit and can mean
> anything.
> 
> I do think this limit definition makes sense for callers that want to
> know about ZERO_OPEN.  But I don’t know why we would have to let other
> callers wait, too.

Keeping separate functions may still be the right approach for v2, 
although I'd still like to name things better ('has_zero_init' vs. 
'has_zero_init_truncate' did not work well for me).  And if I'm renaming 
things, then I'm touching just as much code whether I rename and keep 
separate functions or rename and consolidate into one.

> 
>> Meanwhile, callers tend to only care about
>> bdrv_known_zeroes() right after opening an image or right before
>> resizing (not repeatedly during runtime);
> 
> Hm, yes.  I was thinking of parallels, but that only checks once in
> parallels_open(), so it’s OK.
> 
>> and you also argued elsewhere
>> in this thread that it may be worth having the block layer cache
>> BDRV_ZERO_OPEN and update the cache on any write,
> 
> I didn’t say the block layer, but it if makes sense.
> 
>> at which point, the
>> expense in the driver callback really is a one-time call during
>> bdrv_co_open().
> 
> It definitely doesn’t make sense to me to do that call unconditionally
> in bdrv_co_open().

Okay, you have a point there - while 'qemu-img convert' cares, not all 
clients of bdrv_co_open() are worried about whether the existing 
contents are zero; so unconditionally priming a cache during 
bdrv_co_open is not as wise as doing things when it will actually be 
useful information.  On the other hand, if it is something that clients 
only use when first opening an image, caching data doesn't make much 
sense either.

So, we know that bdrv_has_zero_init() is only viable on a just-created 
image, bdrv_has_zero_init_truncate() is only viable if you are about to 
resize an image using bdrv_co_truncate(PREALLOC_MODE_OFF).

Hmm - thinking aloud: our ultimate goal is that we want to make it 
easier for algorithms that can be sped up IF the image is currently 
known to be all zero.  Maybe what this means is that we really want to 
be tweaking bdrv_make_zeroes() to do all the work, something along the 
lines of:
- if the image is known to already be all zeroes using an O(1) probe 
(this includes if the image was freshly created and creation sees all 
zeroes, or if a block_status at offset 0 shows a hole for the entire 
image, or if an NBD extension advertises all zero at connection 
time...), return success
- if the image has a FAST truncate, and resizing reads zeroes, we can 
truncate to size 0 and back to the desired size, then return success; 
determining if truncate is fast should be similar to how 
BDRV_REQ_NO_FALLBACK determines whether write zeroes is fast
- if the image supports BDRV_REQ_NO_FALLBACK with write zeroes, we can 
request a write zeroes over the whole image, which will either succeed 
(the image is now quickly zero) or fail (writing zeroes as we go is the 
best we can do)
- if the image could report that it is all zeroes, but only at the cost 
of O(n) work such as a loop over block_status (or even O(n^2) with the 
tmpfs lseek bug), it's easier to report failure than to worry about 
making the image read all zeroes

qemu-img would then only ever need to consult --target-is-zero and 
bdrv_make_zero(), and not worry about any other function calls; while 
the block layer would take care of coordinating whatever other call 
sequences make the most sense in reporting success or failure in getting 
the image into an all-zero state if it was not already there.


> 
>> And in that case, whether the one-time expense is done
>> via a single function call or via three driver callbacks, the amount of
>> work is the same; but the driver callback interface is easier if there
>> is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
>> calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
>> though BlockDriverInfo tracks much more than that boolean).
>>
>> In fact, it may be worth consolidating known zeroes support into
>> BlockDriverInfo.
> 
> I’m very skeptical of that.  BDI already has the problem that it doesn’t
> know which of the information the caller actually wants and that it is
> sometimes used in a quasi-hot path.
> 
> Maybe that means it is indeed time to incorporate it into BDI, but the
> caller should have a way of specifying what parts of BDI it actually
> needs and then drivers can skip anything that isn’t trivially obtainable
> that the caller doesn’t need.

I'm reminded of the recent kernel addition of xstat(); the traditional 
stat/fstat interfaces really don't know which bits of information you 
care about, so you get everything, but with xstat(), you can request 
only what you plan to use, which may indeed result in speedups.


>> Those are still viable options, but before I repaint the bikeshed along
>> those lines, I'd at least like a review of whether the overall idea of
>> having a notion of 'reads-all-zeroes' is indeed useful enough,
>> regardless of how we implement it as one vs. three driver callbacks.
> 
> I’m as hesitant as ever to give a review that this notion is useful,
> because I haven’t seen a practical example yet where the problem isn’t
> the fact that NBD doesn’t have 64-bit write_zeroes support.

Even if the NBD protocol gains 64-bit write_zeroes, not all NBD servers 
will be compliant with that extension.  This will speed up operations 
when talking to older servers which do not support 64-bit writes, even 
if newer qemu is never such a server.

> 
> So far, it looks to me like this notion is only really useful for cases
> where we expect a management layer on top of qemu anyway.  And then I’m
> not sure that this new feature works reliably enough for such a
> management layer.
> 
> (I’m not saying it isn’t useful.  Again, intuitively it does seem
> useful.  Intuition can be enough to merge a sufficiently simple series
> that doesn’t increase code complexity too much.  But I’m still asking
> for actual practical examples, because that would make a better
> argument, of course.)

I'm hoping when I post my NBD patches that I can also provide some 
benchmark timing numbers to make the case a bit more concrete.
Max Reitz Feb. 6, 2020, 9:18 a.m. UTC | #16
On 05.02.20 19:39, Eric Blake wrote:
> On 2/5/20 11:22 AM, Max Reitz wrote:
> 
>>
>>>> And thus callers which just want the trivially obtainable
>>>> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
>>>> even though they don’t care about that flag.
>>>
>>> True, but only to a minor extent; and the documentation mentions that
>>> the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
>>> block_status loop.
>>
>> So it must be less expensive than an arbitrarily complex loop.  I think
>> a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?
> 
> If I recall, the tmpfs bug was that it was O(n) where n was based on the
> initial offset and the number of extents prior to that offset.  The
> probe at offset 0 is O(1) (because there are no prior extents), whether
> it reaches the end of the file (the entire image is a hole) or hits data
> beforehand.  It is only probes at later offsets where the speed penalty
> sets in, and where an O(n) loop over all extents turned into an O(n^2)
> traversal time due to the O(n) nature of each later lookup.

So it’s O(n/2) for each lookup on average, which is O(n). O:-)

>> What I’m trying to say is that this is not a good limit and can mean
>> anything.
>>
>> I do think this limit definition makes sense for callers that want to
>> know about ZERO_OPEN.  But I don’t know why we would have to let other
>> callers wait, too.
> 
> Keeping separate functions may still be the right approach for v2,
> although I'd still like to name things better ('has_zero_init' vs.
> 'has_zero_init_truncate' did not work well for me).  And if I'm renaming
> things, then I'm touching just as much code whether I rename and keep
> separate functions or rename and consolidate into one.

I definitely don’t disagree about renaming, and if you think that
consolidating the functions is worth it, then it probably makes sense
(because you have the experience there, given this series).

But I’d still like to throw in that a rename is a more easily doable and
reviewable change than a consolidation, even if you get the same number
of hunks in the end.

>>> Meanwhile, callers tend to only care about
>>> bdrv_known_zeroes() right after opening an image or right before
>>> resizing (not repeatedly during runtime);
>>
>> Hm, yes.  I was thinking of parallels, but that only checks once in
>> parallels_open(), so it’s OK.
>>
>>> and you also argued elsewhere
>>> in this thread that it may be worth having the block layer cache
>>> BDRV_ZERO_OPEN and update the cache on any write,
>>
>> I didn’t say the block layer, but it if makes sense.
>>
>>> at which point, the
>>> expense in the driver callback really is a one-time call during
>>> bdrv_co_open().
>>
>> It definitely doesn’t make sense to me to do that call unconditionally
>> in bdrv_co_open().
> 
> Okay, you have a point there - while 'qemu-img convert' cares, not all
> clients of bdrv_co_open() are worried about whether the existing
> contents are zero; so unconditionally priming a cache during
> bdrv_co_open is not as wise as doing things when it will actually be
> useful information.  On the other hand, if it is something that clients
> only use when first opening an image, caching data doesn't make much
> sense either.
> 
> So, we know that bdrv_has_zero_init() is only viable on a just-created
> image, bdrv_has_zero_init_truncate() is only viable if you are about to
> resize an image using bdrv_co_truncate(PREALLOC_MODE_OFF).
> 
> Hmm - thinking aloud: our ultimate goal is that we want to make it
> easier for algorithms that can be sped up IF the image is currently
> known to be all zero.  Maybe what this means is that we really want to
> be tweaking bdrv_make_zeroes() to do all the work, something along the
> lines of:
> - if the image is known to already be all zeroes using an O(1) probe
> (this includes if the image was freshly created and creation sees all
> zeroes, or if a block_status at offset 0 shows a hole for the entire
> image, or if an NBD extension advertises all zero at connection
> time...), return success

[Insert case here: If the image has a custom make_zeroes implementation,
use it, and return success]

> - if the image has a FAST truncate, and resizing reads zeroes, we can
> truncate to size 0 and back to the desired size, then return success;
> determining if truncate is fast should be similar to how
> BDRV_REQ_NO_FALLBACK determines whether write zeroes is fast
> - if the image supports BDRV_REQ_NO_FALLBACK with write zeroes, we can
> request a write zeroes over the whole image, which will either succeed
> (the image is now quickly zero) or fail (writing zeroes as we go is the
> best we can do)
> - if the image could report that it is all zeroes, but only at the cost
> of O(n) work such as a loop over block_status (or even O(n^2) with the
> tmpfs lseek bug), it's easier to report failure than to worry about
> making the image read all zeroes
> 
> qemu-img would then only ever need to consult --target-is-zero and
> bdrv_make_zero(), and not worry about any other function calls; while
> the block layer would take care of coordinating whatever other call
> sequences make the most sense in reporting success or failure in getting
> the image into an all-zero state if it was not already there.

(As I just wrote on the cover letter thread:) Sounds good to me.

>>> And in that case, whether the one-time expense is done
>>> via a single function call or via three driver callbacks, the amount of
>>> work is the same; but the driver callback interface is easier if there
>>> is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
>>> calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
>>> though BlockDriverInfo tracks much more than that boolean).
>>>
>>> In fact, it may be worth consolidating known zeroes support into
>>> BlockDriverInfo.
>>
>> I’m very skeptical of that.  BDI already has the problem that it doesn’t
>> know which of the information the caller actually wants and that it is
>> sometimes used in a quasi-hot path.
>>
>> Maybe that means it is indeed time to incorporate it into BDI, but the
>> caller should have a way of specifying what parts of BDI it actually
>> needs and then drivers can skip anything that isn’t trivially obtainable
>> that the caller doesn’t need.
> 
> I'm reminded of the recent kernel addition of xstat(); the traditional
> stat/fstat interfaces really don't know which bits of information you
> care about, so you get everything, but with xstat(), you can request
> only what you plan to use, which may indeed result in speedups.

I hope we can put off thinking about it if the known-zeroes check can
simply be put into make_zero(). O:-)

>>> Those are still viable options, but before I repaint the bikeshed along
>>> those lines, I'd at least like a review of whether the overall idea of
>>> having a notion of 'reads-all-zeroes' is indeed useful enough,
>>> regardless of how we implement it as one vs. three driver callbacks.
>>
>> I’m as hesitant as ever to give a review that this notion is useful,
>> because I haven’t seen a practical example yet where the problem isn’t
>> the fact that NBD doesn’t have 64-bit write_zeroes support.
> 
> Even if the NBD protocol gains 64-bit write_zeroes, not all NBD servers
> will be compliant with that extension.  This will speed up operations
> when talking to older servers which do not support 64-bit writes, even
> if newer qemu is never such a server.

The same applies to reads-all-zeroes, though.  Only if both server and
client provide/understand this flag will it do something.

Max
diff mbox series

Patch

diff --git a/block.c b/block.c
index d132662f3103..fac0813140aa 100644
--- a/block.c
+++ b/block.c
@@ -5066,38 +5066,20 @@  int bdrv_get_flags(BlockDriverState *bs)
     return bs->open_flags;
 }

-int bdrv_has_zero_init_1(BlockDriverState *bs)
+int bdrv_known_zeroes_create(BlockDriverState *bs)
 {
-    return 1;
+    return BDRV_ZERO_CREATE;
 }

-int bdrv_has_zero_init(BlockDriverState *bs)
+int bdrv_known_zeroes_truncate(BlockDriverState *bs)
 {
-    if (!bs->drv) {
-        return 0;
-    }
-
-    /*
-     * If BS is a copy on write image, it is initialized to the
-     * contents of the base image, which may not be zeroes.  Likewise,
-     * encrypted images do not read as zero.
-     */
-    if (bs->backing || bs->encrypted) {
-        return 0;
-    }
-    if (bs->drv->bdrv_has_zero_init) {
-        return bs->drv->bdrv_has_zero_init(bs);
-    }
-    if (bs->file && bs->drv->is_filter) {
-        return bdrv_has_zero_init(bs->file->bs);
-    }
-
-    /* safe default */
-    return 0;
+    return BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
 }

-int bdrv_has_zero_init_truncate(BlockDriverState *bs)
+int bdrv_known_zeroes(BlockDriverState *bs)
 {
+    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
+
     if (!bs->drv) {
         return 0;
     }
@@ -5113,9 +5095,12 @@  int bdrv_has_zero_init_truncate(BlockDriverState *bs)
     }

     /*
-     * If the current layer is smaller than the backing layer,
-     * truncation may expose backing data; treat failure to query size
-     * in the same manner. Otherwise, we can trust the driver.
+     * If BS is a copy on write image, it is initialized to the
+     * contents of the base image, which may not be zeroes, so
+     * ZERO_CREATE is not viable.  If the current layer is smaller
+     * than the backing layer, truncation may expose backing data,
+     * restricting ZERO_TRUNCATE; treat failure to query size in the
+     * same manner.  Otherwise, we can trust the driver.
      */

     if (bs->backing) {
@@ -5125,12 +5110,14 @@  int bdrv_has_zero_init_truncate(BlockDriverState *bs)
         if (back < 0 || curr < back) {
             return 0;
         }
+        mask = BDRV_ZERO_TRUNCATE;
     }
-    if (bs->drv->bdrv_has_zero_init_truncate) {
-        return bs->drv->bdrv_has_zero_init_truncate(bs);
+
+    if (bs->drv->bdrv_known_zeroes) {
+        return bs->drv->bdrv_known_zeroes(bs) & mask;
     }
     if (bs->file && bs->drv->is_filter) {
-        return bdrv_has_zero_init_truncate(bs->file->bs);
+        return bdrv_known_zeroes(bs->file->bs) & mask;
     }

     /* safe default */
diff --git a/block/file-posix.c b/block/file-posix.c
index ab82ee1a6718..ff9e39ab882f 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -3071,8 +3071,7 @@  BlockDriver bdrv_file = {
     .bdrv_close = raw_close,
     .bdrv_co_create = raw_co_create,
     .bdrv_co_create_opts = raw_co_create_opts,
-    .bdrv_has_zero_init = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes = bdrv_known_zeroes_truncate,
     .bdrv_co_block_status = raw_co_block_status,
     .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
     .bdrv_co_pwrite_zeroes = raw_co_pwrite_zeroes,
diff --git a/block/file-win32.c b/block/file-win32.c
index 77e8ff7b68ae..e9b8f3b2370b 100644
--- a/block/file-win32.c
+++ b/block/file-win32.c
@@ -635,8 +635,7 @@  BlockDriver bdrv_file = {
     .bdrv_refresh_limits = raw_probe_alignment,
     .bdrv_close         = raw_close,
     .bdrv_co_create_opts = raw_co_create_opts,
-    .bdrv_has_zero_init = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes  = bdrv_known_zeroes_truncate,

     .bdrv_aio_preadv    = raw_aio_preadv,
     .bdrv_aio_pwritev   = raw_aio_pwritev,
diff --git a/block/nfs.c b/block/nfs.c
index 9a6311e27066..34ebe91d5b39 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -702,10 +702,10 @@  out:
     return ret;
 }

-static int nfs_has_zero_init(BlockDriverState *bs)
+static int nfs_known_zeroes(BlockDriverState *bs)
 {
     NFSClient *client = bs->opaque;
-    return client->has_zero_init;
+    return client->has_zero_init ? BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE : 0;
 }

 /* Called (via nfs_service) with QemuMutex held.  */
@@ -869,8 +869,7 @@  static BlockDriver bdrv_nfs = {
     .bdrv_parse_filename            = nfs_parse_filename,
     .create_opts                    = &nfs_create_opts,

-    .bdrv_has_zero_init             = nfs_has_zero_init,
-    .bdrv_has_zero_init_truncate    = nfs_has_zero_init,
+    .bdrv_known_zeroes              = nfs_known_zeroes,
     .bdrv_get_allocated_file_size   = nfs_get_allocated_file_size,
     .bdrv_co_truncate               = nfs_file_co_truncate,

diff --git a/block/parallels.c b/block/parallels.c
index 7a01997659b0..dad6389c8481 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -835,7 +835,7 @@  static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
         goto fail_options;
     }

-    if (!bdrv_has_zero_init_truncate(bs->file->bs)) {
+    if (!(bdrv_known_zeroes(bs->file->bs) & BDRV_ZERO_TRUNCATE)) {
         s->prealloc_mode = PRL_PREALLOC_MODE_FALLOCATE;
     }

@@ -906,7 +906,7 @@  static BlockDriver bdrv_parallels = {
     .bdrv_close		= parallels_close,
     .bdrv_child_perm          = bdrv_format_default_perms,
     .bdrv_co_block_status     = parallels_co_block_status,
-    .bdrv_has_zero_init       = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes        = bdrv_known_zeroes_create,
     .bdrv_co_flush_to_os      = parallels_co_flush_to_os,
     .bdrv_co_readv  = parallels_co_readv,
     .bdrv_co_writev = parallels_co_writev,
diff --git a/block/qcow.c b/block/qcow.c
index fce89898681f..b0c9e212fdb1 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -1183,7 +1183,7 @@  static BlockDriver bdrv_qcow = {
     .bdrv_reopen_prepare    = qcow_reopen_prepare,
     .bdrv_co_create         = qcow_co_create,
     .bdrv_co_create_opts    = qcow_co_create_opts,
-    .bdrv_has_zero_init     = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes      = bdrv_known_zeroes_create,
     .supports_backing       = true,
     .bdrv_refresh_limits    = qcow_refresh_limits,

diff --git a/block/qcow2.c b/block/qcow2.c
index 40aa751d1de7..9f2371925737 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4914,10 +4914,11 @@  static ImageInfoSpecific *qcow2_get_specific_info(BlockDriverState *bs,
     return spec_info;
 }

-static int qcow2_has_zero_init(BlockDriverState *bs)
+static int qcow2_known_zeroes(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
     bool preallocated;
+    int r = BDRV_ZERO_TRUNCATE;

     if (qemu_in_coroutine()) {
         qemu_co_mutex_lock(&s->lock);
@@ -4933,9 +4934,9 @@  static int qcow2_has_zero_init(BlockDriverState *bs)
     }

     if (!preallocated) {
-        return 1;
+        return r | BDRV_ZERO_CREATE;
     } else {
-        return bdrv_has_zero_init(s->data_file->bs);
+        return r | bdrv_known_zeroes(s->data_file->bs);
     }
 }

@@ -5559,8 +5560,7 @@  BlockDriver bdrv_qcow2 = {
     .bdrv_child_perm      = bdrv_format_default_perms,
     .bdrv_co_create_opts  = qcow2_co_create_opts,
     .bdrv_co_create       = qcow2_co_create,
-    .bdrv_has_zero_init   = qcow2_has_zero_init,
-    .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes    = qcow2_known_zeroes,
     .bdrv_co_block_status = qcow2_co_block_status,

     .bdrv_co_preadv_part    = qcow2_co_preadv_part,
diff --git a/block/qed.c b/block/qed.c
index d8c4e5fb1e85..b00cef2035b3 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -1672,8 +1672,7 @@  static BlockDriver bdrv_qed = {
     .bdrv_child_perm          = bdrv_format_default_perms,
     .bdrv_co_create           = bdrv_qed_co_create,
     .bdrv_co_create_opts      = bdrv_qed_co_create_opts,
-    .bdrv_has_zero_init       = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes        = bdrv_known_zeroes_truncate,
     .bdrv_co_block_status     = bdrv_qed_co_block_status,
     .bdrv_co_readv            = bdrv_qed_co_readv,
     .bdrv_co_writev           = bdrv_qed_co_writev,
diff --git a/block/raw-format.c b/block/raw-format.c
index 3a76ec7dd21b..1334a7a2c224 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -409,14 +409,9 @@  static int raw_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
     return bdrv_co_ioctl(bs->file->bs, req, buf);
 }

-static int raw_has_zero_init(BlockDriverState *bs)
+static int raw_known_zeroes(BlockDriverState *bs)
 {
-    return bdrv_has_zero_init(bs->file->bs);
-}
-
-static int raw_has_zero_init_truncate(BlockDriverState *bs)
-{
-    return bdrv_has_zero_init_truncate(bs->file->bs);
+    return bdrv_known_zeroes(bs->file->bs);
 }

 static int coroutine_fn raw_co_create_opts(const char *filename, QemuOpts *opts,
@@ -577,8 +572,7 @@  BlockDriver bdrv_raw = {
     .bdrv_lock_medium     = &raw_lock_medium,
     .bdrv_co_ioctl        = &raw_co_ioctl,
     .create_opts          = &raw_create_opts,
-    .bdrv_has_zero_init   = &raw_has_zero_init,
-    .bdrv_has_zero_init_truncate = &raw_has_zero_init_truncate,
+    .bdrv_known_zeroes    = &raw_known_zeroes,
     .strong_runtime_opts  = raw_strong_runtime_opts,
     .mutable_opts         = mutable_opts,
 };
diff --git a/block/rbd.c b/block/rbd.c
index 027cbcc69520..6cd8e86bccec 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1289,8 +1289,7 @@  static BlockDriver bdrv_rbd = {
     .bdrv_reopen_prepare    = qemu_rbd_reopen_prepare,
     .bdrv_co_create         = qemu_rbd_co_create,
     .bdrv_co_create_opts    = qemu_rbd_co_create_opts,
-    .bdrv_has_zero_init     = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes      = bdrv_known_zeroes_truncate,
     .bdrv_get_info          = qemu_rbd_getinfo,
     .create_opts            = &qemu_rbd_create_opts,
     .bdrv_getlength         = qemu_rbd_getlength,
diff --git a/block/sheepdog.c b/block/sheepdog.c
index 522c16a93676..916e64abdd74 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -3229,8 +3229,7 @@  static BlockDriver bdrv_sheepdog = {
     .bdrv_close                   = sd_close,
     .bdrv_co_create               = sd_co_create,
     .bdrv_co_create_opts          = sd_co_create_opts,
-    .bdrv_has_zero_init           = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate  = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes            = bdrv_known_zeroes_truncate,
     .bdrv_getlength               = sd_getlength,
     .bdrv_get_allocated_file_size = sd_get_allocated_file_size,
     .bdrv_co_truncate             = sd_co_truncate,
@@ -3268,8 +3267,7 @@  static BlockDriver bdrv_sheepdog_tcp = {
     .bdrv_close                   = sd_close,
     .bdrv_co_create               = sd_co_create,
     .bdrv_co_create_opts          = sd_co_create_opts,
-    .bdrv_has_zero_init           = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate  = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes            = bdrv_known_zeroes_truncate,
     .bdrv_getlength               = sd_getlength,
     .bdrv_get_allocated_file_size = sd_get_allocated_file_size,
     .bdrv_co_truncate             = sd_co_truncate,
@@ -3307,8 +3305,7 @@  static BlockDriver bdrv_sheepdog_unix = {
     .bdrv_close                   = sd_close,
     .bdrv_co_create               = sd_co_create,
     .bdrv_co_create_opts          = sd_co_create_opts,
-    .bdrv_has_zero_init           = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate  = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes            = bdrv_known_zeroes_truncate,
     .bdrv_getlength               = sd_getlength,
     .bdrv_get_allocated_file_size = sd_get_allocated_file_size,
     .bdrv_co_truncate             = sd_co_truncate,
diff --git a/block/ssh.c b/block/ssh.c
index b4375cf7d2e5..e89dae39800c 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -1007,14 +1007,14 @@  static void ssh_close(BlockDriverState *bs)
     ssh_state_free(s);
 }

-static int ssh_has_zero_init(BlockDriverState *bs)
+static int ssh_known_zeroes(BlockDriverState *bs)
 {
     BDRVSSHState *s = bs->opaque;
     /* Assume false, unless we can positively prove it's true. */
     int has_zero_init = 0;

     if (s->attrs->type == SSH_FILEXFER_TYPE_REGULAR) {
-        has_zero_init = 1;
+        has_zero_init = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
     }

     return has_zero_init;
@@ -1390,8 +1390,7 @@  static BlockDriver bdrv_ssh = {
     .bdrv_co_create               = ssh_co_create,
     .bdrv_co_create_opts          = ssh_co_create_opts,
     .bdrv_close                   = ssh_close,
-    .bdrv_has_zero_init           = ssh_has_zero_init,
-    .bdrv_has_zero_init_truncate  = ssh_has_zero_init,
+    .bdrv_known_zeroes            = ssh_known_zeroes,
     .bdrv_co_readv                = ssh_co_readv,
     .bdrv_co_writev               = ssh_co_writev,
     .bdrv_getlength               = ssh_getlength,
diff --git a/block/vdi.c b/block/vdi.c
index 0142da723315..df8f62624ccf 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -989,14 +989,14 @@  static void vdi_close(BlockDriverState *bs)
     error_free(s->migration_blocker);
 }

-static int vdi_has_zero_init(BlockDriverState *bs)
+static int vdi_known_zeroes(BlockDriverState *bs)
 {
     BDRVVdiState *s = bs->opaque;

     if (s->header.image_type == VDI_TYPE_STATIC) {
-        return bdrv_has_zero_init(bs->file->bs);
+        return bdrv_known_zeroes(bs->file->bs) & BDRV_ZERO_CREATE;
     } else {
-        return 1;
+        return BDRV_ZERO_CREATE;
     }
 }

@@ -1040,7 +1040,7 @@  static BlockDriver bdrv_vdi = {
     .bdrv_child_perm          = bdrv_format_default_perms,
     .bdrv_co_create      = vdi_co_create,
     .bdrv_co_create_opts = vdi_co_create_opts,
-    .bdrv_has_zero_init  = vdi_has_zero_init,
+    .bdrv_known_zeroes   = vdi_known_zeroes,
     .bdrv_co_block_status = vdi_co_block_status,
     .bdrv_make_empty = vdi_make_empty,

diff --git a/block/vhdx.c b/block/vhdx.c
index f02d2611bef8..4e8320c1b855 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -1365,7 +1365,7 @@  static coroutine_fn int vhdx_co_writev(BlockDriverState *bs, int64_t sector_num,
                 /* Queue another write of zero buffers if the underlying file
                  * does not zero-fill on file extension */

-                if (bdrv_has_zero_init_truncate(bs->file->bs) == 0) {
+                if (!(bdrv_known_zeroes(bs->file->bs) & BDRV_ZERO_TRUNCATE)) {
                     use_zero_buffers = true;

                     /* zero fill the front, if any */
@@ -1720,8 +1720,8 @@  static int vhdx_create_bat(BlockBackend *blk, BDRVVHDXState *s,
     }

     if (type == VHDX_TYPE_FIXED ||
-                use_zero_blocks ||
-                bdrv_has_zero_init(blk_bs(blk)) == 0) {
+        use_zero_blocks ||
+        !(bdrv_known_zeroes(blk_bs(blk)) & BDRV_ZERO_CREATE)) {
         /* for a fixed file, the default BAT entry is not zero */
         s->bat = g_try_malloc0(length);
         if (length && s->bat == NULL) {
@@ -2162,7 +2162,7 @@  static int coroutine_fn vhdx_co_check(BlockDriverState *bs,
     return 0;
 }

-static int vhdx_has_zero_init(BlockDriverState *bs)
+static int vhdx_known_zeroes(BlockDriverState *bs)
 {
     BDRVVHDXState *s = bs->opaque;
     int state;
@@ -2173,17 +2173,17 @@  static int vhdx_has_zero_init(BlockDriverState *bs)
      * therefore enough to check the first BAT entry.
      */
     if (!s->bat_entries) {
-        return 1;
+        return BDRV_ZERO_CREATE;
     }

     state = s->bat[0] & VHDX_BAT_STATE_BIT_MASK;
     if (state == PAYLOAD_BLOCK_FULLY_PRESENT) {
         /* Fixed subformat */
-        return bdrv_has_zero_init(bs->file->bs);
+        return bdrv_known_zeroes(bs->file->bs) & BDRV_ZERO_CREATE;
     }

     /* Dynamic subformat */
-    return 1;
+    return BDRV_ZERO_CREATE;
 }

 static QemuOptsList vhdx_create_opts = {
@@ -2239,7 +2239,7 @@  static BlockDriver bdrv_vhdx = {
     .bdrv_co_create_opts    = vhdx_co_create_opts,
     .bdrv_get_info          = vhdx_get_info,
     .bdrv_co_check          = vhdx_co_check,
-    .bdrv_has_zero_init     = vhdx_has_zero_init,
+    .bdrv_known_zeroes      = vhdx_known_zeroes,

     .create_opts            = &vhdx_create_opts,
 };
diff --git a/block/vmdk.c b/block/vmdk.c
index 20e909d99794..ca59f50413d2 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -2815,7 +2815,7 @@  static int64_t vmdk_get_allocated_file_size(BlockDriverState *bs)
     return ret;
 }

-static int vmdk_has_zero_init(BlockDriverState *bs)
+static int vmdk_known_zeroes(BlockDriverState *bs)
 {
     int i;
     BDRVVmdkState *s = bs->opaque;
@@ -2824,12 +2824,13 @@  static int vmdk_has_zero_init(BlockDriverState *bs)
      * return 0. */
     for (i = 0; i < s->num_extents; i++) {
         if (s->extents[i].flat) {
-            if (!bdrv_has_zero_init(s->extents[i].file->bs)) {
+            if (!(bdrv_known_zeroes(s->extents[i].file->bs) &
+                  BDRV_ZERO_CREATE)) {
                 return 0;
             }
         }
     }
-    return 1;
+    return BDRV_ZERO_CREATE;
 }

 static ImageInfo *vmdk_get_extent_info(VmdkExtent *extent)
@@ -3052,7 +3053,7 @@  static BlockDriver bdrv_vmdk = {
     .bdrv_co_flush_to_disk        = vmdk_co_flush,
     .bdrv_co_block_status         = vmdk_co_block_status,
     .bdrv_get_allocated_file_size = vmdk_get_allocated_file_size,
-    .bdrv_has_zero_init           = vmdk_has_zero_init,
+    .bdrv_known_zeroes            = vmdk_known_zeroes,
     .bdrv_get_specific_info       = vmdk_get_specific_info,
     .bdrv_refresh_limits          = vmdk_refresh_limits,
     .bdrv_get_info                = vmdk_get_info,
diff --git a/block/vpc.c b/block/vpc.c
index a65550298e19..f4741e07bfb2 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -1173,15 +1173,15 @@  fail:
 }


-static int vpc_has_zero_init(BlockDriverState *bs)
+static int vpc_known_zeroes(BlockDriverState *bs)
 {
     BDRVVPCState *s = bs->opaque;
     VHDFooter *footer =  (VHDFooter *) s->footer_buf;

     if (be32_to_cpu(footer->type) == VHD_FIXED) {
-        return bdrv_has_zero_init(bs->file->bs);
+        return bdrv_known_zeroes(bs->file->bs) & BDRV_ZERO_CREATE;
     } else {
-        return 1;
+        return BDRV_ZERO_CREATE;
     }
 }

@@ -1249,7 +1249,7 @@  static BlockDriver bdrv_vpc = {
     .bdrv_get_info          = vpc_get_info,

     .create_opts            = &vpc_create_opts,
-    .bdrv_has_zero_init     = vpc_has_zero_init,
+    .bdrv_known_zeroes      = vpc_known_zeroes,
     .strong_runtime_opts    = vpc_strong_runtime_opts,
 };

diff --git a/blockdev.c b/blockdev.c
index c6a727cca99d..90a17e7f7bce 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -4001,7 +4001,7 @@  void qmp_drive_mirror(DriveMirror *arg, Error **errp)

     zero_target = (arg->sync == MIRROR_SYNC_MODE_FULL &&
                    (arg->mode == NEW_IMAGE_MODE_EXISTING ||
-                    !bdrv_has_zero_init(target_bs)));
+                    !(bdrv_known_zeroes(target_bs) & BDRV_ZERO_CREATE)));


     /* Honor bdrv_try_set_aio_context() context acquisition requirements. */
diff --git a/include/block/block.h b/include/block/block.h
index 6cd566324d95..a6a227f50678 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -85,6 +85,28 @@  typedef enum {
     BDRV_REQ_MASK               = 0x3ff,
 } BdrvRequestFlags;

+typedef enum {
+    /*
+     * bdrv_known_zeroes() should include this bit if the contents of
+     * a freshly-created image with no backing file reads as all
+     * zeroes without any additional effort.  If .bdrv_co_truncate is
+     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
+     * Since this bit is only reliable at image creation, a driver may
+     * return this bit even for existing images that do not currently
+     * read as zero.
+     */
+    BDRV_ZERO_CREATE        = 0x1,
+
+    /*
+     * bdrv_known_zeroes() should include this bit if growing an image
+     * with PREALLOC_MODE_OFF (either with no backing file, or beyond
+     * the size of the backing file) will read the new data as all
+     * zeroes without any additional effort.  This bit only matters
+     * for drivers that set .bdrv_co_truncate.
+     */
+    BDRV_ZERO_TRUNCATE      = 0x2,
+} BdrvZeroFlags;
+
 typedef struct BlockSizes {
     uint32_t phys;
     uint32_t log;
@@ -430,9 +452,9 @@  void bdrv_drain_all(void);

 int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
 int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
-int bdrv_has_zero_init_1(BlockDriverState *bs);
-int bdrv_has_zero_init(BlockDriverState *bs);
-int bdrv_has_zero_init_truncate(BlockDriverState *bs);
+int bdrv_known_zeroes_create(BlockDriverState *bs);
+int bdrv_known_zeroes_truncate(BlockDriverState *bs);
+int bdrv_known_zeroes(BlockDriverState *bs);
 bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs);
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
 int bdrv_block_status(BlockDriverState *bs, int64_t offset,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 77ab45dc87cf..47b34860bf95 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -441,19 +441,8 @@  struct BlockDriver {

     void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);

-    /*
-     * Returns 1 if newly created images are guaranteed to contain only
-     * zeros, 0 otherwise.
-     * Must return 0 if .bdrv_co_truncate is set and
-     * .bdrv_has_zero_init_truncate() returns 0.
-     */
-    int (*bdrv_has_zero_init)(BlockDriverState *bs);
-
-    /*
-     * Returns 1 if new areas added by growing the image with
-     * PREALLOC_MODE_OFF contain only zeros, 0 otherwise.
-     */
-    int (*bdrv_has_zero_init_truncate)(BlockDriverState *bs);
+    /* Returns bitwise-OR of BdrvZeroFlags. */
+    int (*bdrv_known_zeroes)(BlockDriverState *bs);

     /* Remove fd handlers, timers, and other event loop callbacks so the event
      * loop is no longer in use.  Called with no in-flight requests and in
diff --git a/qemu-img.c b/qemu-img.c
index e0bfc33ef4f6..e60217e6c382 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1987,7 +1987,8 @@  static int convert_do_copy(ImgConvertState *s)
     /* Check whether we have zero initialisation or can get it efficiently */
     if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
         !s->target_has_backing) {
-        s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
+        s->has_zero_init = !!(bdrv_known_zeroes(blk_bs(s->target)) &
+                              BDRV_ZERO_CREATE);
     }

     if (!s->has_zero_init && !s->target_has_backing &&
diff --git a/tests/qemu-iotests/122 b/tests/qemu-iotests/122
index dfa350936fe6..7cb09309948f 100755
--- a/tests/qemu-iotests/122
+++ b/tests/qemu-iotests/122
@@ -267,7 +267,7 @@  echo
 # Keep source zero
 _make_test_img 64M

-# Output is not zero, but has bdrv_has_zero_init() == 1
+# Output is not zero, but has bdrv_known_zeroes() including BDRV_ZERO_CREATE
 TEST_IMG="$TEST_IMG".orig _make_test_img 64M
 $QEMU_IO -c "write -P 42 0 64k" "$TEST_IMG".orig | _filter_qemu_io

diff --git a/tests/qemu-iotests/188 b/tests/qemu-iotests/188
index afca44df5427..9656969fef4a 100755
--- a/tests/qemu-iotests/188
+++ b/tests/qemu-iotests/188
@@ -71,7 +71,7 @@  $QEMU_IO --object $SECRETALT -c "read -P 0xa 0 $size" --image-opts $IMGSPEC | _f
 _cleanup_test_img

 echo
-echo "== verify that has_zero_init returns false when preallocating =="
+echo "== verify that known_zeroes returns 0 when preallocating =="

 # Empty source file
 if [ -n "$TEST_IMG_FILE" ]; then
diff --git a/tests/qemu-iotests/188.out b/tests/qemu-iotests/188.out
index c568ef370145..f7da30440c65 100644
--- a/tests/qemu-iotests/188.out
+++ b/tests/qemu-iotests/188.out
@@ -16,7 +16,7 @@  read 16777216/16777216 bytes at offset 0
 == verify open failure with wrong password ==
 qemu-io: can't open: Invalid password, cannot unlock any keyslot

-== verify that has_zero_init returns false when preallocating ==
+== verify that known_zeroes returns 0 when preallocating ==
 Formatting 'TEST_DIR/t.IMGFMT.orig', fmt=IMGFMT size=16777216
 Images are identical.
 *** done