diff mbox series

[v6,10/10] qcow2: Forward ZERO_WRITE flag for full preallocation

Message ID 20200423150127.142609-11-kwolf@redhat.com (mailing list archive)
State New, archived
Headers show
Series block: Fix resize (extending) of short overlays | expand

Commit Message

Kevin Wolf April 23, 2020, 3:01 p.m. UTC
The BDRV_REQ_ZERO_WRITE is currently implemented in a way that first the
image is possibly preallocated and then the zero flag is added to all
clusters. This means that a copy-on-write operation may be needed when
writing to these clusters, despite having used preallocation, negating
one of the major benefits of preallocation.

Instead, try to forward the BDRV_REQ_ZERO_WRITE to the protocol driver,
and if the protocol driver can ensure that the new area reads as zeros,
we can skip setting the zero flag in the qcow2 layer.

Unfortunately, the same approach doesn't work for metadata
preallocation, so we'll still set the zero flag there.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c              | 22 +++++++++++++++++++---
 tests/qemu-iotests/274.out |  4 ++--
 2 files changed, 21 insertions(+), 5 deletions(-)

Comments

Eric Blake April 23, 2020, 3:36 p.m. UTC | #1
On 4/23/20 10:01 AM, Kevin Wolf wrote:
> The BDRV_REQ_ZERO_WRITE is currently implemented in a way that first the
> image is possibly preallocated and then the zero flag is added to all
> clusters. This means that a copy-on-write operation may be needed when
> writing to these clusters, despite having used preallocation, negating
> one of the major benefits of preallocation.
> 
> Instead, try to forward the BDRV_REQ_ZERO_WRITE to the protocol driver,
> and if the protocol driver can ensure that the new area reads as zeros,
> we can skip setting the zero flag in the qcow2 layer.

Hmm.  When we get block status, it is very easy to tell that something 
reads as zero when the qcow2 file has the zero bit set, but when the 
qcow2 file does not have the zero bit set, we have to then query the 
format layer whether it reads as zeros (which is expensive enough for 
some format layers that we no longer report things as reading as zero). 
I'm worried that optimizing this case could penalize later block status.

We already can tell the difference between a cluster that has the zero 
bit set but is not preallocated, vs. has the zero bit set and is 
preallocated.  Are we really forcing a copy-on-write to a cluster that 
is marked zero but preallocated?  Is the problem that we don't have a 
way to distinguish between 'reads as zeroes, allocated, but we don't 
know state of format layer' and 'reads as zeroes, allocated, and we know 
format layer reads as zeroes'?

> 
> Unfortunately, the same approach doesn't work for metadata
> preallocation, so we'll still set the zero flag there.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/qcow2.c              | 22 +++++++++++++++++++---
>   tests/qemu-iotests/274.out |  4 ++--
>   2 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index ad621fe404..b28e588942 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -4170,9 +4170,25 @@ static int coroutine_fn qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
>           /* Allocate the data area */
>           new_file_size = allocation_start +
>                           nb_new_data_clusters * s->cluster_size;
> -        /* Image file grows, so @exact does not matter */
> -        ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
> -                               errp);
> +        /*
> +         * Image file grows, so @exact does not matter.
> +         *
> +         * If we need to zero out the new area, try first whether the protocol
> +         * driver can already take care of this.
> +         */
> +        if (flags & BDRV_REQ_ZERO_WRITE) {
> +            ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc,
> +                                   BDRV_REQ_ZERO_WRITE, errp);
> +            if (ret >= 0) {
> +                flags &= ~BDRV_REQ_ZERO_WRITE;
> +            }

Hmm - just noticing things: how does this series interplay with the 
existing bdrv_has_zero_init_truncate?  Should this series automatically 
handle BDRV_REQ_ZERO_WRITE on a bdrv_co_truncate(PREALLOC_NONE) request 
for all drivers that report true, even if that driver does not advertise 
support for the BDRV_REQ_ZERO_WRITE flag?

> +        } else {
> +            ret = -1;
> +        }

Here, ret == -1 does not imply whether errp is set - but annoyingly, 
errp CAN be set if bdrv_co_truncate() failed.

> +        if (ret < 0) {
> +            ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
> +                                   errp);

And here, you are passing a possibly-set errp back to 
bdrv_co_truncate().  That is a bug that can abort.  You need to pass 
NULL to the first bdrv_co_truncate() call or else clear errp prior to 
trying a fallback to this second bdrv_co_truncate() call.
Kevin Wolf April 23, 2020, 4:04 p.m. UTC | #2
Am 23.04.2020 um 17:36 hat Eric Blake geschrieben:
> On 4/23/20 10:01 AM, Kevin Wolf wrote:
> > The BDRV_REQ_ZERO_WRITE is currently implemented in a way that first the
> > image is possibly preallocated and then the zero flag is added to all
> > clusters. This means that a copy-on-write operation may be needed when
> > writing to these clusters, despite having used preallocation, negating
> > one of the major benefits of preallocation.
> > 
> > Instead, try to forward the BDRV_REQ_ZERO_WRITE to the protocol driver,
> > and if the protocol driver can ensure that the new area reads as zeros,
> > we can skip setting the zero flag in the qcow2 layer.
> 
> Hmm.  When we get block status, it is very easy to tell that something reads
> as zero when the qcow2 file has the zero bit set, but when the qcow2 file
> does not have the zero bit set, we have to then query the format layer
> whether it reads as zeros (which is expensive enough for some format layers
> that we no longer report things as reading as zero). I'm worried that
> optimizing this case could penalize later block status.

That's just how preallocation works. If you don't want that, you need
preallocation=off.

> We already can tell the difference between a cluster that has the zero bit
> set but is not preallocated, vs. has the zero bit set and is preallocated.
> Are we really forcing a copy-on-write to a cluster that is marked zero but
> preallocated?  Is the problem that we don't have a way to distinguish
> between 'reads as zeroes, allocated, but we don't know state of format
> layer' and 'reads as zeroes, allocated, and we know format layer reads as
> zeroes'?

Basically, yes. If we had this, we could have a type of cluster where
writing to it still involves a metadata update (to clear the zero flag),
but never copy-on-write, even for partial writes.

I'm not sure if this would cover a very relevant case, though.

> > 
> > Unfortunately, the same approach doesn't work for metadata
> > preallocation, so we'll still set the zero flag there.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   block/qcow2.c              | 22 +++++++++++++++++++---
> >   tests/qemu-iotests/274.out |  4 ++--
> >   2 files changed, 21 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/qcow2.c b/block/qcow2.c
> > index ad621fe404..b28e588942 100644
> > --- a/block/qcow2.c
> > +++ b/block/qcow2.c
> > @@ -4170,9 +4170,25 @@ static int coroutine_fn qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
> >           /* Allocate the data area */
> >           new_file_size = allocation_start +
> >                           nb_new_data_clusters * s->cluster_size;
> > -        /* Image file grows, so @exact does not matter */
> > -        ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
> > -                               errp);
> > +        /*
> > +         * Image file grows, so @exact does not matter.
> > +         *
> > +         * If we need to zero out the new area, try first whether the protocol
> > +         * driver can already take care of this.
> > +         */
> > +        if (flags & BDRV_REQ_ZERO_WRITE) {
> > +            ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc,
> > +                                   BDRV_REQ_ZERO_WRITE, errp);
> > +            if (ret >= 0) {
> > +                flags &= ~BDRV_REQ_ZERO_WRITE;
> > +            }
> 
> Hmm - just noticing things: how does this series interplay with the existing
> bdrv_has_zero_init_truncate?  Should this series automatically handle
> BDRV_REQ_ZERO_WRITE on a bdrv_co_truncate(PREALLOC_NONE) request for all
> drivers that report true, even if that driver does not advertise support for
> the BDRV_REQ_ZERO_WRITE flag?

Hmm... It feels risky to me.

> > +        } else {
> > +            ret = -1;
> > +        }
> 
> Here, ret == -1 does not imply whether errp is set - but annoyingly, errp
> CAN be set if bdrv_co_truncate() failed.
> 
> > +        if (ret < 0) {
> > +            ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
> > +                                   errp);
> 
> And here, you are passing a possibly-set errp back to bdrv_co_truncate().
> That is a bug that can abort.  You need to pass NULL to the first
> bdrv_co_truncate() call or else clear errp prior to trying a fallback to
> this second bdrv_co_truncate() call.

Yes, you're right. If nothing else comes up, I'll fix this while
applying.

Kevin
Eric Blake April 23, 2020, 4:15 p.m. UTC | #3
On 4/23/20 11:04 AM, Kevin Wolf wrote:

>> Hmm.  When we get block status, it is very easy to tell that something reads
>> as zero when the qcow2 file has the zero bit set, but when the qcow2 file
>> does not have the zero bit set, we have to then query the format layer
>> whether it reads as zeros (which is expensive enough for some format layers
>> that we no longer report things as reading as zero). I'm worried that
>> optimizing this case could penalize later block status.
> 
> That's just how preallocation works. If you don't want that, you need
> preallocation=off.

Good point. And if I recall, didn't we already have a discussion (or 
even patches) to optimize whether querying the format layer during block 
status was even worth the effort, depending on heuristics of the size of 
the format layer which in turn is based on whether there was 
preallocation?  So not a show-stopper.

> 
>> We already can tell the difference between a cluster that has the zero bit
>> set but is not preallocated, vs. has the zero bit set and is preallocated.
>> Are we really forcing a copy-on-write to a cluster that is marked zero but
>> preallocated?  Is the problem that we don't have a way to distinguish
>> between 'reads as zeroes, allocated, but we don't know state of format
>> layer' and 'reads as zeroes, allocated, and we know format layer reads as
>> zeroes'?
> 
> Basically, yes. If we had this, we could have a type of cluster where
> writing to it still involves a metadata update (to clear the zero flag),
> but never copy-on-write, even for partial writes.
> 
> I'm not sure if this would cover a very relevant case, though.

I also wonder if Berto's subcluster patches might play into this scenario.

>>
>> Hmm - just noticing things: how does this series interplay with the existing
>> bdrv_has_zero_init_truncate?  Should this series automatically handle
>> BDRV_REQ_ZERO_WRITE on a bdrv_co_truncate(PREALLOC_NONE) request for all
>> drivers that report true, even if that driver does not advertise support for
>> the BDRV_REQ_ZERO_WRITE flag?
> 
> Hmm... It feels risky to me.

Or worded differently, is bdrv_has_zero_init_truncate even still 
necessary (when it is documented only to cover the PREALLOC_NONE case), 
or should we get rid of it in favor of using BDRV_REQ_ZERO_WRITE 
everywhere instead?  (Which in turn involves visiting all drivers that 
previously advertised bdrv_has_zero_init_truncate... but I already have 
work in my tree trying to do that as part of preparing to add an 
autoclear bit to qcow2 to make it faster to report when a qcow2 image is 
known all-zero content...)

Looks like I'll be rebasing my work on top of this series.

> 
>>> +        } else {
>>> +            ret = -1;
>>> +        }
>>
>> Here, ret == -1 does not imply whether errp is set - but annoyingly, errp
>> CAN be set if bdrv_co_truncate() failed.
>>
>>> +        if (ret < 0) {
>>> +            ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
>>> +                                   errp);
>>
>> And here, you are passing a possibly-set errp back to bdrv_co_truncate().
>> That is a bug that can abort.  You need to pass NULL to the first
>> bdrv_co_truncate() call or else clear errp prior to trying a fallback to
>> this second bdrv_co_truncate() call.
> 
> Yes, you're right. If nothing else comes up, I'll fix this while
> applying.
> 
> Kevin
>
Max Reitz April 23, 2020, 6:05 p.m. UTC | #4
On 23.04.20 17:01, Kevin Wolf wrote:
> The BDRV_REQ_ZERO_WRITE is currently implemented in a way that first the
> image is possibly preallocated and then the zero flag is added to all
> clusters. This means that a copy-on-write operation may be needed when
> writing to these clusters, despite having used preallocation, negating
> one of the major benefits of preallocation.
> 
> Instead, try to forward the BDRV_REQ_ZERO_WRITE to the protocol driver,
> and if the protocol driver can ensure that the new area reads as zeros,
> we can skip setting the zero flag in the qcow2 layer.
> 
> Unfortunately, the same approach doesn't work for metadata
> preallocation, so we'll still set the zero flag there.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/qcow2.c              | 22 +++++++++++++++++++---
>  tests/qemu-iotests/274.out |  4 ++--
>  2 files changed, 21 insertions(+), 5 deletions(-)

Oh, nice.  I didn’t think you (or anyone else for that matter) would
actually do this. :)

With the errp thing fixed:

Reviewed-by: Max Reitz <mreitz@redhat.com>
diff mbox series

Patch

diff --git a/block/qcow2.c b/block/qcow2.c
index ad621fe404..b28e588942 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4170,9 +4170,25 @@  static int coroutine_fn qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
         /* Allocate the data area */
         new_file_size = allocation_start +
                         nb_new_data_clusters * s->cluster_size;
-        /* Image file grows, so @exact does not matter */
-        ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
-                               errp);
+        /*
+         * Image file grows, so @exact does not matter.
+         *
+         * If we need to zero out the new area, try first whether the protocol
+         * driver can already take care of this.
+         */
+        if (flags & BDRV_REQ_ZERO_WRITE) {
+            ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc,
+                                   BDRV_REQ_ZERO_WRITE, errp);
+            if (ret >= 0) {
+                flags &= ~BDRV_REQ_ZERO_WRITE;
+            }
+        } else {
+            ret = -1;
+        }
+        if (ret < 0) {
+            ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
+                                   errp);
+        }
         if (ret < 0) {
             error_prepend(errp, "Failed to resize underlying file: ");
             qcow2_free_clusters(bs, allocation_start,
diff --git a/tests/qemu-iotests/274.out b/tests/qemu-iotests/274.out
index 179bd7ccaf..c355b52689 100644
--- a/tests/qemu-iotests/274.out
+++ b/tests/qemu-iotests/274.out
@@ -179,7 +179,7 @@  read 65536/65536 bytes at offset 9437184
 10 MiB (0xa00000) bytes     allocated at offset 5 MiB (0x500000)
 
 [{ "start": 0, "length": 5242880, "depth": 1, "zero": true, "data": false},
-{ "start": 5242880, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 327680}]
+{ "start": 5242880, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 327680}]
 
 === preallocation=full ===
 Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=16777216 cluster_size=65536 lazy_refcounts=off refcount_bits=16
@@ -198,7 +198,7 @@  read 65536/65536 bytes at offset 11534336
 4 MiB (0x400000) bytes     allocated at offset 8 MiB (0x800000)
 
 [{ "start": 0, "length": 8388608, "depth": 1, "zero": true, "data": false},
-{ "start": 8388608, "length": 4194304, "depth": 0, "zero": true, "data": false, "offset": 327680}]
+{ "start": 8388608, "length": 4194304, "depth": 0, "zero": false, "data": true, "offset": 327680}]
 
 === preallocation=off ===
 Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=393216 cluster_size=65536 lazy_refcounts=off refcount_bits=16