diff mbox

[v9,01/13] qcow2: Unallocate unmapped zero clusters if no backing file

Message ID 20170411011718.9152-2-eblake@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eric Blake April 11, 2017, 1:17 a.m. UTC
'qemu-img map' already coalesces information about unallocated
clusters (those with status 0) and pure zero clusters (those
with status BDRV_BLOCK_ZERO and no offset).  Furthermore, all
qcow2 images with no backing file already report all unallocated
clusters (in the preallocation sense of clusters with no offset)
as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was
set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of
BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit
to external users), thanks to generic block layer code in
bdrv_co_get_block_status().

So, for an image with no backing file, having bdrv_pwrite_zeroes
mark clusters as unallocated (defer to backing file) rather than
reads-as-zero (regardless of backing file) makes no difference
to normal behavior, but may potentially allow for fewer writes to
the L2 table of a freshly-created image where the L2 table is
initially written to all-zeroes (although I don't actually know
if we skip an L2 update and flush when re-writing the same
contents as already present).

Furthermore, this matches the behavior of discard_single_l2(), in
favoring an unallocated cluster over a zero cluster when full
discard is requested.

Meanwhile, version 2 qcow2 files (compat=0.10) lack support for an
explicit zero cluster.  This minor tweak therefore allows us to turn
write zeroes with unmap into an actual unallocation on those files,
where they used to return -ENOTSUP and cause an allocation due to
the fallback to explicitly written zeroes.

Note that technically, we _could_ write a cluster as unallocated
rather than zero if a backing file exists but the backing file
also reads as zero, but that's more expensive to determine, so
this optimization is limited to qcow2 without a backing file.

Also note that this patch unmaps a compressed cluster when there
is no backing file, even when BDRV_REQ_MAY_UNMAP was not set, but
it is unlikely to have compressed clusters in a fully preallocated
image (the point of compression is to reduce space requirements),
so the side effect of unmapping a cluster in that case is not
deemed to be a problem.

Finally, note that this change can create a subtle difference if a
backing file is later added with 'qemu-img rebase -u'; pre-patch,
a v3 file (compat=1.1) will have a cluster that still reads as 0
(the cluster is not allocated in the sense of preallocation, but
is provided from the layer), while post-patch the cluster will
now defer to the backing file - but it's already an unsupported
action if you are modifying guest-visible data by messing with
backing chains ;)

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v9: new patch
---
 block/qcow2-cluster.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Comments

Kevin Wolf April 12, 2017, 9:49 a.m. UTC | #1
Am 11.04.2017 um 03:17 hat Eric Blake geschrieben:
> 'qemu-img map' already coalesces information about unallocated
> clusters (those with status 0) and pure zero clusters (those
> with status BDRV_BLOCK_ZERO and no offset).  Furthermore, all
> qcow2 images with no backing file already report all unallocated
> clusters (in the preallocation sense of clusters with no offset)
> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was
> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of
> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit
> to external users), thanks to generic block layer code in
> bdrv_co_get_block_status().
> 
> So, for an image with no backing file, having bdrv_pwrite_zeroes
> mark clusters as unallocated (defer to backing file) rather than
> reads-as-zero (regardless of backing file) makes no difference
> to normal behavior, but may potentially allow for fewer writes to
> the L2 table of a freshly-created image where the L2 table is
> initially written to all-zeroes (although I don't actually know
> if we skip an L2 update and flush when re-writing the same
> contents as already present).

I don't get this. Allocating a cluster always involves an L2 update, no
matter whether it was previously unallocated or a zero cluster.

> Furthermore, this matches the behavior of discard_single_l2(), in
> favoring an unallocated cluster over a zero cluster when full
> discard is requested.

The only use for "full discard" is qcow2_make_empty(). It explicitly
requests that the backing file becomes visible again. This is a
completely different case.

In other words, in order to stay consistent between discard and
write_zeroes from a guest POV, we need to leave this code alone.

> Meanwhile, version 2 qcow2 files (compat=0.10) lack support for an
> explicit zero cluster.  This minor tweak therefore allows us to turn
> write zeroes with unmap into an actual unallocation on those files,
> where they used to return -ENOTSUP and cause an allocation due to
> the fallback to explicitly written zeroes.

Okay, this is true.

But I doubt that making write_zeroes more efficient on v2 images without
a backing file is really worth any extra complexity at this point...

Kevin
Eric Blake April 12, 2017, 1:32 p.m. UTC | #2
On 04/12/2017 04:49 AM, Kevin Wolf wrote:
> Am 11.04.2017 um 03:17 hat Eric Blake geschrieben:
>> 'qemu-img map' already coalesces information about unallocated
>> clusters (those with status 0) and pure zero clusters (those
>> with status BDRV_BLOCK_ZERO and no offset).  Furthermore, all
>> qcow2 images with no backing file already report all unallocated
>> clusters (in the preallocation sense of clusters with no offset)
>> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was
>> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of
>> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit
>> to external users), thanks to generic block layer code in
>> bdrv_co_get_block_status().
>>
>> So, for an image with no backing file, having bdrv_pwrite_zeroes
>> mark clusters as unallocated (defer to backing file) rather than
>> reads-as-zero (regardless of backing file) makes no difference
>> to normal behavior, but may potentially allow for fewer writes to
>> the L2 table of a freshly-created image where the L2 table is
>> initially written to all-zeroes (although I don't actually know
>> if we skip an L2 update and flush when re-writing the same
>> contents as already present).
> 
> I don't get this. Allocating a cluster always involves an L2 update, no
> matter whether it was previously unallocated or a zero cluster.

But we're NOT allocating a cluster.  Either we are unmapping a cluster,
or are we are changing a pure unallocated cluster (bdrv_get_block_status
of 0) to a read-as-zero but unallocated cluster (status 1) [different
from the 'bdrv_is_allocated' sense of an allocated cluster being any
cluster determined by the current layer instead of by its backing file -
I really wish we wouldn't overload 'allocated' to two different
meanings, but am not sure which of the two meanings would be better to
rename, and to what].  My argument is that for a brand new image with no
backing file, all the clusters are pure unallocated, and that a write
zero with unmapping is doing more work by writing 0 => 1 than it would
by leaving 0 => 0, with identical results.


> 
>> Furthermore, this matches the behavior of discard_single_l2(), in
>> favoring an unallocated cluster over a zero cluster when full
>> discard is requested.
> 
> The only use for "full discard" is qcow2_make_empty(). It explicitly
> requests that the backing file becomes visible again. This is a
> completely different case.
> 
> In other words, in order to stay consistent between discard and
> write_zeroes from a guest POV, we need to leave this code alone.

I already argued that from a guest POV, the code is identical.  The only
situation that I changed is the situation where there is no backing
file, so the guest is guaranteed to read zeros from the cluster whether
the cluster is marked pure unallocated (0) or as reads-as-zero
unallocated (1).  The guest already requested unmapping, so they don't
need the preallocated-reads-as-zero (1 | offset).

Note that I absolutely agree that we must NOT make this change when
there is a backing file - a guest request to write zeroes MUST use
reads-as-zero (1) or preallocated reads-as-zero (1 | offset), and not
pure unallocated, unless we can guarantee that falling back to the
backing file will also read as zeroes, but taking the time to learn the
backing file status is not worth the expense.

> 
>> Meanwhile, version 2 qcow2 files (compat=0.10) lack support for an
>> explicit zero cluster.  This minor tweak therefore allows us to turn
>> write zeroes with unmap into an actual unallocation on those files,
>> where they used to return -ENOTSUP and cause an allocation due to
>> the fallback to explicitly written zeroes.
> 
> Okay, this is true.
> 
> But I doubt that making write_zeroes more efficient on v2 images without
> a backing file is really worth any extra complexity at this point...

Is it just my commit message that is wrong?  The code itself is fairly
short to review (I spent a lot more time on the commit message than on
the code - and it looks like I still need more time on the commit
message).  And later patches in the series DO depend on this, so I'm not
ready to just drop the patch yet.
Eric Blake April 21, 2017, 6:42 p.m. UTC | #3
On 04/12/2017 04:49 AM, Kevin Wolf wrote:
>> Furthermore, this matches the behavior of discard_single_l2(), in
>> favoring an unallocated cluster over a zero cluster when full
>> discard is requested.
> 
> The only use for "full discard" is qcow2_make_empty(). It explicitly
> requests that the backing file becomes visible again. This is a
> completely different case.

No, let's look at that code - I'm not referring to the
qcow2_make_empty() case, but to the bdrv_pdiscard() case:


        /*
         * If full_discard is false, make sure that a discarded area
reads back
         * as zeroes for v3 images (we cannot do it for v2 without actually
         * writing a zero-filled buffer). We can skip the operation if the
         * cluster is already marked as zero, or if it's unallocated and we
         * don't have a backing file.
         *
         * TODO We might want to use bdrv_get_block_status(bs) here, but
we're
         * holding s->lock, so that doesn't work today.
         *
         * If full_discard is true, the sector should not read back as
zeroes,
         * but rather fall through to the backing file.
         */
        switch (qcow2_get_cluster_type(old_l2_entry)) {
            case QCOW2_CLUSTER_UNALLOCATED:
                if (full_discard || !bs->backing) {
                    continue;
                }
                break;


That says: if our cluster is currently unallocated, and we have no
backing file, then LEAVE it unallocated (rather than marking it
QCOW2_CLUSTER_ZERO), even through bdrv_pdiscard.

Meanwhile, the next lines:

            case QCOW2_CLUSTER_ZERO:
                if (!full_discard) {
                    continue;
                }

state that if we are doing bdrv_pdiscard(), and the cluster already
reads as zeroes, then minimize the churn by not changing the cluster
(and that in turn means we do NOT lose a preallocation).

> 
> In other words, in order to stay consistent between discard and
> write_zeroes from a guest POV, we need to leave this code alone.

Again, my argument is that if we have no backing file, and the cluster
is previously unallocated, then leaving it unallocated is equivalent to
changing it to QCOW2_CLUSTER_ZERO.  There is no difference to the guest
POV by this change.

> 
>> Meanwhile, version 2 qcow2 files (compat=0.10) lack support for an
>> explicit zero cluster.  This minor tweak therefore allows us to turn
>> write zeroes with unmap into an actual unallocation on those files,
>> where they used to return -ENOTSUP and cause an allocation due to
>> the fallback to explicitly written zeroes.
> 
> Okay, this is true.
> 
> But I doubt that making write_zeroes more efficient on v2 images without
> a backing file is really worth any extra complexity at this point...

My argument is that it is not much complexity, and that making this
change coupled with the new qemu-iotest 179 in patch 2/13 (which fails
without this change) enabled me to test qemu-io -c 'w -z -u' (which
nothing else was testing) and ensure that I'm not regression when the
rest of the series further tweaks things to be byte-based.
Eric Blake May 11, 2017, 2:56 p.m. UTC | #4
[revisiting this older patch version, even though the final version in
today's pull request changed somewhat from this approach]

On 04/12/2017 04:49 AM, Kevin Wolf wrote:
> Am 11.04.2017 um 03:17 hat Eric Blake geschrieben:
>> 'qemu-img map' already coalesces information about unallocated
>> clusters (those with status 0) and pure zero clusters (those
>> with status BDRV_BLOCK_ZERO and no offset).  Furthermore, all
>> qcow2 images with no backing file already report all unallocated
>> clusters (in the preallocation sense of clusters with no offset)
>> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was
>> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of
>> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit
>> to external users), thanks to generic block layer code in
>> bdrv_co_get_block_status().
>>
>> So, for an image with no backing file, having bdrv_pwrite_zeroes
>> mark clusters as unallocated (defer to backing file) rather than
>> reads-as-zero (regardless of backing file) makes no difference
>> to normal behavior, but may potentially allow for fewer writes to
>> the L2 table of a freshly-created image where the L2 table is
>> initially written to all-zeroes (although I don't actually know
>> if we skip an L2 update and flush when re-writing the same
>> contents as already present).
> 
> I don't get this. Allocating a cluster always involves an L2 update, no
> matter whether it was previously unallocated or a zero cluster.

On IRC, John, Kevin, and I were discussing the current situation with
libvirt NBD storage migration.  When libvirt creates a file on the
destination (rather than the user pre-creating it), it currently
defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3]
(arguably something that could be improved in libvirt, but
https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a
bug).

Therefore, the use case of doing a mirror job to a v2 image, and having
that image become thick even though the source was thin, is happening
more than we'd like
(https://bugzilla.redhat.com/show_bug.cgi?id=1371749).  While Kevin had
a point that in the common case we ALWAYS want to turn an unallocated
cluster into a zero cluster (so that we don't have to audit whether all
callers are properly accounting for the case where a backing image is
added later), our conversation on IRC today conceded that we may want to
introduce a new BDRV_REQ_READ_ZERO_NOP (or some such name) that
particular callers can use to request that if a cluster already reads as
zeroes, the write zero request does NOT have to change it.  Normal guest
operations would not use the flag, but mirroring IS a case that would
use the flag, so that we can end up with thinner mirrors even to 0.10
images.

The other consideration is that on 0.10 images, even if we have to
allocate, right now our allocation is done by way of failing with
-ENOTSUP and falling back to the normal pwrite() of explicit zeroes.  It
may be worth teaching the qcow2 layer to explicitly handle write zeroes,
even on 0.10 images, by allocating a cluster (as needed) but then
telling bs->file to write zeroes (punching a hole as appropriate) so
that the file is still thin.  In fact, it matches the fact that we
already have code that probes whether a qcow2 cluster that reports
BDRV_BLOCK_DATA|BDRV_BLOCK_OFFSET_VALID then probes bs->file to see if
there is a hole there, at which point it can add BDRV_BLOCK_ZERO to the
bdrv_get_block_status.

I don't know which one of us will tackle patches along these lines, but
wanted to at least capture the gist of the IRC conversation in the
archives for a bit more permanence.
Eric Blake May 11, 2017, 3:18 p.m. UTC | #5
On 05/11/2017 09:56 AM, Eric Blake wrote:
> 
> On IRC, John, Kevin, and I were discussing the current situation with
> libvirt NBD storage migration.  When libvirt creates a file on the
> destination (rather than the user pre-creating it), it currently
> defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3]
> (arguably something that could be improved in libvirt, but
> https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a
> bug).
> 
> Therefore, the use case of doing a mirror job to a v2 image, and having
> that image become thick even though the source was thin, is happening
> more than we'd like
> (https://bugzilla.redhat.com/show_bug.cgi?id=1371749).  While Kevin had

Bah, pasted the wrong link.  Correct one is:
https://bugzilla.redhat.com/show_bug.cgi?id=1219541
Max Reitz May 12, 2017, 4:06 p.m. UTC | #6
On 2017-05-11 16:56, Eric Blake wrote:
> [revisiting this older patch version, even though the final version in
> today's pull request changed somewhat from this approach]
> 
> On 04/12/2017 04:49 AM, Kevin Wolf wrote:
>> Am 11.04.2017 um 03:17 hat Eric Blake geschrieben:
>>> 'qemu-img map' already coalesces information about unallocated
>>> clusters (those with status 0) and pure zero clusters (those
>>> with status BDRV_BLOCK_ZERO and no offset).  Furthermore, all
>>> qcow2 images with no backing file already report all unallocated
>>> clusters (in the preallocation sense of clusters with no offset)
>>> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was
>>> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of
>>> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit
>>> to external users), thanks to generic block layer code in
>>> bdrv_co_get_block_status().
>>>
>>> So, for an image with no backing file, having bdrv_pwrite_zeroes
>>> mark clusters as unallocated (defer to backing file) rather than
>>> reads-as-zero (regardless of backing file) makes no difference
>>> to normal behavior, but may potentially allow for fewer writes to
>>> the L2 table of a freshly-created image where the L2 table is
>>> initially written to all-zeroes (although I don't actually know
>>> if we skip an L2 update and flush when re-writing the same
>>> contents as already present).
>>
>> I don't get this. Allocating a cluster always involves an L2 update, no
>> matter whether it was previously unallocated or a zero cluster.
> 
> On IRC, John, Kevin, and I were discussing the current situation with
> libvirt NBD storage migration.  When libvirt creates a file on the
> destination (rather than the user pre-creating it), it currently
> defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3]
> (arguably something that could be improved in libvirt, but
> https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a
> bug).
> 
> Therefore, the use case of doing a mirror job to a v2 image, and having
> that image become thick even though the source was thin, is happening
> more than we'd like
> (https://bugzilla.redhat.com/show_bug.cgi?id=1371749).  While Kevin had
> a point that in the common case we ALWAYS want to turn an unallocated
> cluster into a zero cluster (so that we don't have to audit whether all
> callers are properly accounting for the case where a backing image is
> added later), our conversation on IRC today conceded that we may want to
> introduce a new BDRV_REQ_READ_ZERO_NOP (or some such name) that
> particular callers can use to request that if a cluster already reads as
> zeroes, the write zero request does NOT have to change it.  Normal guest
> operations would not use the flag, but mirroring IS a case that would
> use the flag, so that we can end up with thinner mirrors even to 0.10
> images.
> 
> The other consideration is that on 0.10 images, even if we have to
> allocate, right now our allocation is done by way of failing with
> -ENOTSUP and falling back to the normal pwrite() of explicit zeroes.  It
> may be worth teaching the qcow2 layer to explicitly handle write zeroes,
> even on 0.10 images, by allocating a cluster (as needed) but then
> telling bs->file to write zeroes (punching a hole as appropriate) so
> that the file is still thin.  In fact, it matches the fact that we
> already have code that probes whether a qcow2 cluster that reports
> BDRV_BLOCK_DATA|BDRV_BLOCK_OFFSET_VALID then probes bs->file to see if
> there is a hole there, at which point it can add BDRV_BLOCK_ZERO to the
> bdrv_get_block_status.
> 
> I don't know which one of us will tackle patches along these lines, but
> wanted to at least capture the gist of the IRC conversation in the
> archives for a bit more permanence.

Just short ideas:

(1) I do consider it a bug if v2 images are created. The BZ wasn't
closed for a very good reason, but because nobody answered this question:

> is this really a problem?

And apparently it is.

(2) There is a way of creating zero clusters on v2 images, but we've
never done it (yet): You create a single data cluster containing zeroes
and then you just point to it whenever you need a zero cluster (until
its refcount overflows, at which point you allocate another cluster).
Maybe that helps.

I'd be really careful with optimizations for v2 images. You have already
proven to me that my fears of too much complexity are out of proportions
sometimes, but still. v2 upstream is old legacy and should be treated as
such, at least IMHO.

Max
John Snow May 12, 2017, 11 p.m. UTC | #7
On 05/12/2017 12:06 PM, Max Reitz wrote:
> On 2017-05-11 16:56, Eric Blake wrote:
>> [revisiting this older patch version, even though the final version in
>> today's pull request changed somewhat from this approach]
>>
>> On 04/12/2017 04:49 AM, Kevin Wolf wrote:
>>> Am 11.04.2017 um 03:17 hat Eric Blake geschrieben:
>>>> 'qemu-img map' already coalesces information about unallocated
>>>> clusters (those with status 0) and pure zero clusters (those
>>>> with status BDRV_BLOCK_ZERO and no offset).  Furthermore, all
>>>> qcow2 images with no backing file already report all unallocated
>>>> clusters (in the preallocation sense of clusters with no offset)
>>>> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was
>>>> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of
>>>> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit
>>>> to external users), thanks to generic block layer code in
>>>> bdrv_co_get_block_status().
>>>>
>>>> So, for an image with no backing file, having bdrv_pwrite_zeroes
>>>> mark clusters as unallocated (defer to backing file) rather than
>>>> reads-as-zero (regardless of backing file) makes no difference
>>>> to normal behavior, but may potentially allow for fewer writes to
>>>> the L2 table of a freshly-created image where the L2 table is
>>>> initially written to all-zeroes (although I don't actually know
>>>> if we skip an L2 update and flush when re-writing the same
>>>> contents as already present).
>>>
>>> I don't get this. Allocating a cluster always involves an L2 update, no
>>> matter whether it was previously unallocated or a zero cluster.
>>
>> On IRC, John, Kevin, and I were discussing the current situation with
>> libvirt NBD storage migration.  When libvirt creates a file on the
>> destination (rather than the user pre-creating it), it currently
>> defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3]
>> (arguably something that could be improved in libvirt, but
>> https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a
>> bug).
>>
>> Therefore, the use case of doing a mirror job to a v2 image, and having
>> that image become thick even though the source was thin, is happening
>> more than we'd like
>> (https://bugzilla.redhat.com/show_bug.cgi?id=1371749).  While Kevin had
>> a point that in the common case we ALWAYS want to turn an unallocated
>> cluster into a zero cluster (so that we don't have to audit whether all
>> callers are properly accounting for the case where a backing image is
>> added later), our conversation on IRC today conceded that we may want to
>> introduce a new BDRV_REQ_READ_ZERO_NOP (or some such name) that
>> particular callers can use to request that if a cluster already reads as
>> zeroes, the write zero request does NOT have to change it.  Normal guest
>> operations would not use the flag, but mirroring IS a case that would
>> use the flag, so that we can end up with thinner mirrors even to 0.10
>> images.
>>
>> The other consideration is that on 0.10 images, even if we have to
>> allocate, right now our allocation is done by way of failing with
>> -ENOTSUP and falling back to the normal pwrite() of explicit zeroes.  It
>> may be worth teaching the qcow2 layer to explicitly handle write zeroes,
>> even on 0.10 images, by allocating a cluster (as needed) but then
>> telling bs->file to write zeroes (punching a hole as appropriate) so
>> that the file is still thin.  In fact, it matches the fact that we
>> already have code that probes whether a qcow2 cluster that reports
>> BDRV_BLOCK_DATA|BDRV_BLOCK_OFFSET_VALID then probes bs->file to see if
>> there is a hole there, at which point it can add BDRV_BLOCK_ZERO to the
>> bdrv_get_block_status.
>>
>> I don't know which one of us will tackle patches along these lines, but
>> wanted to at least capture the gist of the IRC conversation in the
>> archives for a bit more permanence.
> 
> Just short ideas:
> 
> (1) I do consider it a bug if v2 images are created. The BZ wasn't
> closed for a very good reason, but because nobody answered this question:
> 
>> is this really a problem?
> 
> And apparently it is.
> 
> (2) There is a way of creating zero clusters on v2 images, but we've
> never done it (yet): You create a single data cluster containing zeroes
> and then you just point to it whenever you need a zero cluster (until
> its refcount overflows, at which point you allocate another cluster).
> Maybe that helps.
> 
> I'd be really careful with optimizations for v2 images. You have already
> proven to me that my fears of too much complexity are out of proportions
> sometimes, but still. v2 upstream is old legacy and should be treated as
> such, at least IMHO.
> 
> Max
> 
> 

I agree that V2 changes should be limited in nature, but there are
less-fiddly ways to have thin 0.10 images on sparse filesystems. This
way feels a little too clever and mostly likely too intrusive for an old
format.

I'd also think that it'd probably confuse older copies of qemu-img
check, so maybe this is too hacky of a solution.
Max Reitz May 15, 2017, 6:35 p.m. UTC | #8
On 2017-05-13 01:00, John Snow wrote:
> 
> 
> On 05/12/2017 12:06 PM, Max Reitz wrote:
>> On 2017-05-11 16:56, Eric Blake wrote:
>>> [revisiting this older patch version, even though the final version in
>>> today's pull request changed somewhat from this approach]
>>>
>>> On 04/12/2017 04:49 AM, Kevin Wolf wrote:
>>>> Am 11.04.2017 um 03:17 hat Eric Blake geschrieben:
>>>>> 'qemu-img map' already coalesces information about unallocated
>>>>> clusters (those with status 0) and pure zero clusters (those
>>>>> with status BDRV_BLOCK_ZERO and no offset).  Furthermore, all
>>>>> qcow2 images with no backing file already report all unallocated
>>>>> clusters (in the preallocation sense of clusters with no offset)
>>>>> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was
>>>>> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of
>>>>> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit
>>>>> to external users), thanks to generic block layer code in
>>>>> bdrv_co_get_block_status().
>>>>>
>>>>> So, for an image with no backing file, having bdrv_pwrite_zeroes
>>>>> mark clusters as unallocated (defer to backing file) rather than
>>>>> reads-as-zero (regardless of backing file) makes no difference
>>>>> to normal behavior, but may potentially allow for fewer writes to
>>>>> the L2 table of a freshly-created image where the L2 table is
>>>>> initially written to all-zeroes (although I don't actually know
>>>>> if we skip an L2 update and flush when re-writing the same
>>>>> contents as already present).
>>>>
>>>> I don't get this. Allocating a cluster always involves an L2 update, no
>>>> matter whether it was previously unallocated or a zero cluster.
>>>
>>> On IRC, John, Kevin, and I were discussing the current situation with
>>> libvirt NBD storage migration.  When libvirt creates a file on the
>>> destination (rather than the user pre-creating it), it currently
>>> defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3]
>>> (arguably something that could be improved in libvirt, but
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a
>>> bug).
>>>
>>> Therefore, the use case of doing a mirror job to a v2 image, and having
>>> that image become thick even though the source was thin, is happening
>>> more than we'd like
>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1371749).  While Kevin had
>>> a point that in the common case we ALWAYS want to turn an unallocated
>>> cluster into a zero cluster (so that we don't have to audit whether all
>>> callers are properly accounting for the case where a backing image is
>>> added later), our conversation on IRC today conceded that we may want to
>>> introduce a new BDRV_REQ_READ_ZERO_NOP (or some such name) that
>>> particular callers can use to request that if a cluster already reads as
>>> zeroes, the write zero request does NOT have to change it.  Normal guest
>>> operations would not use the flag, but mirroring IS a case that would
>>> use the flag, so that we can end up with thinner mirrors even to 0.10
>>> images.

Reading this again, you could just pass through the write-zero request.
The issue right now is that the qcow2 driver just returns -ENOTSUP for
every write-zero request on v2 images when it could just create a data
cluster and invoke a write-zero request on the data.

Just not suppressing the BDRV_REQ_ZERO_WRITE flag for the generic
fall-back zero write (with bs->supported_write_flags listing it) and
then issuing a bdrv_co_pwrite_zeroes() in qcow2_co_pwritev() works (with
BDRV_REQ_MAY_UNMAP).

(Whether or not to set BDRV_REQ_MAY_UNMAP should be derived from...
Maybe QCOW2_DISCARD_OTHER?)

>>> The other consideration is that on 0.10 images, even if we have to
>>> allocate, right now our allocation is done by way of failing with
>>> -ENOTSUP and falling back to the normal pwrite() of explicit zeroes.  It
>>> may be worth teaching the qcow2 layer to explicitly handle write zeroes,
>>> even on 0.10 images, by allocating a cluster (as needed) but then
>>> telling bs->file to write zeroes (punching a hole as appropriate) so
>>> that the file is still thin.  In fact, it matches the fact that we
>>> already have code that probes whether a qcow2 cluster that reports
>>> BDRV_BLOCK_DATA|BDRV_BLOCK_OFFSET_VALID then probes bs->file to see if
>>> there is a hole there, at which point it can add BDRV_BLOCK_ZERO to the
>>> bdrv_get_block_status.
>>>
>>> I don't know which one of us will tackle patches along these lines, but
>>> wanted to at least capture the gist of the IRC conversation in the
>>> archives for a bit more permanence.
>>
>> Just short ideas:
>>
>> (1) I do consider it a bug if v2 images are created. The BZ wasn't
>> closed for a very good reason, but because nobody answered this question:
>>
>>> is this really a problem?
>>
>> And apparently it is.
>>
>> (2) There is a way of creating zero clusters on v2 images, but we've
>> never done it (yet): You create a single data cluster containing zeroes
>> and then you just point to it whenever you need a zero cluster (until
>> its refcount overflows, at which point you allocate another cluster).
>> Maybe that helps.
>>
>> I'd be really careful with optimizations for v2 images. You have already
>> proven to me that my fears of too much complexity are out of proportions
>> sometimes, but still. v2 upstream is old legacy and should be treated as
>> such, at least IMHO.
>>
>> Max
>>
>>
> 
> I agree that V2 changes should be limited in nature, but there are
> less-fiddly ways to have thin 0.10 images on sparse filesystems. This
> way feels a little too clever and mostly likely too intrusive for an old
> format.

Yes.

> I'd also think that it'd probably confuse older copies of qemu-img
> check, so maybe this is too hacky of a solution.

I don't.

Anyway, feel free to convince me by writing a simple patch that allows
the v2 sparseness you need.

Let's again get to the root of the issue: Mirroring to a new v2 qcow2
image does not create a sparse image. There are multiple questions here:

(1) Who does mirroring?
(2) Who uses v2 qcow2 for the destination?
(3) Who needs sparse images?

(1) is clear, everybody does it. (3) is rather clear, it would be nice
to have (but not an absolute catastrophe if we didn't). So what remains
is (2).

From qemu's perspective, qcow2v3 has been the default for quite some
time. So just using qemu, nobody will use v2.

Maybe you're using an old qemu, maybe you're using RHEL 6's qemu which
only supports v2. This is an argument, but I'd very much just recommend
an update. (Bit of an internal question: Are you willing to backport the
upstream changes this would bring to RHEL 6?)

And then there's the case the BZ is about: Maybe you're using a
management tool which forces v2 images. I'd call it a bug in the
management tool because we didn't make v3 the default for nothing. Maybe
there are good reasons why some management tools create v2, but I don't
know any and when the BZ was closed none were given. It was just closed
because there was no apparent reason not to (but there is!).


So I see the following courses of action:

1. Ignore the v2 "issue". It's not an upstream issue, it's a bug in a
management tool, so let's just reopen the BZ and ignore RHEL 6 users.
Also, v2 just is a limited format, there is a reason v3 exists! So I
wouldn't even call it an issue, it's just how it is.

2. Improve v2. Would "fix" the issue at hand while still not letting
RHEL 7's libvirt create v3 images (which might very well come back to
haunt us some other day), and would complicate upstream code for no
upstream reason.

Now, as I said, I'm open to be convinced that (2) is viable: If the
change is simple enough, sure, let's take it, why not. But I don't see
the point still.

Max
diff mbox

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 100398c..12f44b2 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1579,7 +1579,8 @@  static int zero_single_l2(BlockDriverState *bs, uint64_t offset,
         /* Update L2 entries */
         qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table);
         if (old_offset & QCOW_OFLAG_COMPRESSED || flags & BDRV_REQ_MAY_UNMAP) {
-            l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO);
+            l2_table[l2_index + i] = bs->backing
+                ? cpu_to_be64(QCOW_OFLAG_ZERO) : 0;
             qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
         } else {
             l2_table[l2_index + i] |= cpu_to_be64(QCOW_OFLAG_ZERO);
@@ -1598,8 +1599,11 @@  int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int nb_sectors,
     uint64_t nb_clusters;
     int ret;

-    /* The zero flag is only supported by version 3 and newer */
-    if (s->qcow_version < 3) {
+    /* The zero flag is only supported by version 3 and newer; we
+     * require the use of that flag if there is a backing file or if
+     * we are not allowed to unmap.  */
+    if (s->qcow_version < 3 &&
+        (bs->backing || !(flags & BDRV_REQ_MAY_UNMAP))) {
         return -ENOTSUP;
     }