[for,2.7,v3,1/1] qcow2: improve qcow2_co_write_zeroes()

Message ID	1462950014-1821-1-git-send-email-den@openvz.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: "Denis V. Lunev" <den@openvz.org> To: qemu-devel@nongnu.org Date: Wed, 11 May 2016 10:00:14 +0300 Message-Id: <1462950014-1821-1-git-send-email-den@openvz.org> Subject: [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve qcow2_co_write_zeroes() Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, den@openvz.org, Max Reitz <mreitz@redhat.com> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Denis V. Lunev May 11, 2016, 7 a.m. UTC

There is a possibility that qcow2_co_write_zeroes() will be called
with the partial block. This could be synthetically triggered with
    qemu-io -c "write -z 32k 4k"
and can happen in the real life in qemu-nbd. The latter happens under
the following conditions:
    (1) qemu-nbd is started with --detect-zeroes=on and is connected to the
        kernel NBD client
    (2) third party program opens kernel NBD device with O_DIRECT
    (3) third party program performs write operation with memory buffer
        not aligned to the page
In this case qcow2_co_write_zeroes() is unable to perform the operation
and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller
switches to non-optimized version and writes real zeroes to the disk.

The patch creates a shortcut. If the block is read as zeroes, f.e. if
it is unallocated, the request is extended to cover full block.
User-visible situation with this block is not changed. Before the patch
the block is filled in the image with real zeroes. After that patch the
block is marked as zeroed in metadata. Thus any subsequent changes in
backing store chain are not affected.

Kevin, thank you for a cool suggestion.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
---
Changes from v2:
- checked head/tail clusters separately (one can be zeroed, one unallocated)
- fixed range calculations
- fixed race when the block can become used just after the check
- fixed zero cluster detection
- minor tweaks in the description

Changes from v1:
- description rewritten completely
- new approach suggested by Kevin is implemented

 block/qcow2.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 59 insertions(+), 6 deletions(-)

Kevin Wolf May 11, 2016, 11:28 a.m. UTC | #1

Am 11.05.2016 um 09:00 hat Denis V. Lunev geschrieben:
> There is a possibility that qcow2_co_write_zeroes() will be called
> with the partial block. This could be synthetically triggered with
>     qemu-io -c "write -z 32k 4k"
> and can happen in the real life in qemu-nbd. The latter happens under
> the following conditions:
>     (1) qemu-nbd is started with --detect-zeroes=on and is connected to the
>         kernel NBD client
>     (2) third party program opens kernel NBD device with O_DIRECT
>     (3) third party program performs write operation with memory buffer
>         not aligned to the page
> In this case qcow2_co_write_zeroes() is unable to perform the operation
> and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller
> switches to non-optimized version and writes real zeroes to the disk.
> 
> The patch creates a shortcut. If the block is read as zeroes, f.e. if
> it is unallocated, the request is extended to cover full block.
> User-visible situation with this block is not changed. Before the patch
> the block is filled in the image with real zeroes. After that patch the
> block is marked as zeroed in metadata. Thus any subsequent changes in
> backing store chain are not affected.
> 
> Kevin, thank you for a cool suggestion.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> ---
> Changes from v2:
> - checked head/tail clusters separately (one can be zeroed, one unallocated)
> - fixed range calculations
> - fixed race when the block can become used just after the check
> - fixed zero cluster detection
> - minor tweaks in the description
> 
> Changes from v1:
> - description rewritten completely
> - new approach suggested by Kevin is implemented
> 
>  block/qcow2.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 59 insertions(+), 6 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 470734b..c2474c1 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2411,21 +2411,74 @@ finish:
>      return ret;
>  }
>  
> +
> +static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int nr;
> +    BlockDriverState *file;
> +    int64_t res = bdrv_get_block_status_above(bs, NULL, start,
> +                                              s->cluster_sectors, &nr, &file);
> +    return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA));

Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that
all unallocated clusters return true, even if the backing file contains
non-zero data for them.

> +}
> +
> +static bool is_zero_cluster_top_locked(BlockDriverState *bs, int64_t start)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int nr = s->cluster_sectors;
> +    uint64_t off;
> +    int ret;
> +
> +    ret = qcow2_get_cluster_offset(bs, start << BDRV_SECTOR_BITS, &nr, &off);
> +    return ret == QCOW2_CLUSTER_UNALLOCATED || ret == QCOW2_CLUSTER_ZERO;
> +}

This part is new, but it also returns true for clusters that are
allocated in the backing file.

>  static coroutine_fn int qcow2_co_write_zeroes(BlockDriverState *bs,
>      int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>  {
>      int ret;
>      BDRVQcow2State *s = bs->opaque;
>  
> -    /* Emulate misaligned zero writes */
> -    if (sector_num % s->cluster_sectors || nb_sectors % s->cluster_sectors) {
> -        return -ENOTSUP;
> +    int head = sector_num % s->cluster_sectors;
> +    int tail = (sector_num + nb_sectors) % s->cluster_sectors;
> +
> +    if (head != 0 || tail != 0) {
> +        int64_t cl_end = -1;
> +
> +        sector_num -= head;
> +        nb_sectors += head;
> +
> +        if (tail != 0) {
> +            nb_sectors += s->cluster_sectors - tail;
> +        }
> +
> +        if (!is_zero_cluster(bs, sector_num)) {
> +            return -ENOTSUP;
> +        }
> +
> +        if (nb_sectors > s->cluster_sectors) {
> +            /* Technically the request can cover 2 clusters, f.e. 4k write
> +               at s->cluster_sectors - 2k offset. One of these cluster can
> +               be zeroed, one unallocated */

This cannot happen. bdrv_co_write_zeroes() splits requests not only
based on the length, but so that they are actually aligned at cluster
boundaries.

> +            cl_end = sector_num + nb_sectors - s->cluster_sectors;
> +            if (!is_zero_cluster(bs, cl_end)) {
> +                return -ENOTSUP;
> +            }
> +        }
> +
> +        qemu_co_mutex_lock(&s->lock);
> +        /* We can have new write after previous check */
> +        if (!is_zero_cluster_top_locked(bs, sector_num) ||
> +                (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
> +            qemu_co_mutex_unlock(&s->lock);
> +            return -ENOTSUP;
> +        }

Just lock the mutex before the check, the possible optimisation for the
emulation case (which is slow anyway) isn't worth the additional code
complexity.

> +    } else {
> +        qemu_co_mutex_lock(&s->lock);
>      }
>  
>      /* Whatever is left can use real zero clusters */
> -    qemu_co_mutex_lock(&s->lock);
> -    ret = qcow2_zero_clusters(bs, sector_num << BDRV_SECTOR_BITS,
> -        nb_sectors);
> +    ret = qcow2_zero_clusters(bs, sector_num << BDRV_SECTOR_BITS, nb_sectors);
>      qemu_co_mutex_unlock(&s->lock);

In the end, I think v2 with just the bugs fixed (those that I mentioned
in the v2 review, plus the locking problem that you found) would be
better than this version.

Kevin

Denis V. Lunev May 12, 2016, 9 a.m. UTC | #2

On 05/11/2016 02:28 PM, Kevin Wolf wrote:
> Am 11.05.2016 um 09:00 hat Denis V. Lunev geschrieben:
>> There is a possibility that qcow2_co_write_zeroes() will be called
>> with the partial block. This could be synthetically triggered with
>>      qemu-io -c "write -z 32k 4k"
>> and can happen in the real life in qemu-nbd. The latter happens under
>> the following conditions:
>>      (1) qemu-nbd is started with --detect-zeroes=on and is connected to the
>>          kernel NBD client
>>      (2) third party program opens kernel NBD device with O_DIRECT
>>      (3) third party program performs write operation with memory buffer
>>          not aligned to the page
>> In this case qcow2_co_write_zeroes() is unable to perform the operation
>> and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller
>> switches to non-optimized version and writes real zeroes to the disk.
>>
>> The patch creates a shortcut. If the block is read as zeroes, f.e. if
>> it is unallocated, the request is extended to cover full block.
>> User-visible situation with this block is not changed. Before the patch
>> the block is filled in the image with real zeroes. After that patch the
>> block is marked as zeroed in metadata. Thus any subsequent changes in
>> backing store chain are not affected.
>>
>> Kevin, thank you for a cool suggestion.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> ---
>> Changes from v2:
>> - checked head/tail clusters separately (one can be zeroed, one unallocated)
>> - fixed range calculations
>> - fixed race when the block can become used just after the check
>> - fixed zero cluster detection
>> - minor tweaks in the description
>>
>> Changes from v1:
>> - description rewritten completely
>> - new approach suggested by Kevin is implemented
>>
>>   block/qcow2.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
>>   1 file changed, 59 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 470734b..c2474c1 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -2411,21 +2411,74 @@ finish:
>>       return ret;
>>   }
>>   
>> +
>> +static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    int nr;
>> +    BlockDriverState *file;
>> +    int64_t res = bdrv_get_block_status_above(bs, NULL, start,
>> +                                              s->cluster_sectors, &nr, &file);
>> +    return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA));
> Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that
> all unallocated clusters return true, even if the backing file contains
> non-zero data for them.

this is correct. From my POW this means that this area is unallocated
in the entire backing chain and thus it will be read as zeroes. Thus
we could cover it with zeroes.

Indeed,

# create top image
hades ~/src/qemu $ qemu-img create -f qcow2 test2.qcow2 -b test.qcow2 64G
Formatting 'test2.qcow2', fmt=qcow2 size=68719476736 
backing_file='test.qcow2' encryption=off cluster_size=65536 
lazy_refcounts=off refcount_bits=16

# create backing store
hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G
Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off 
cluster_size=65536 lazy_refcounts=off refcount_bits=16

# write to 2 first data clusters into backing store
hades ~/src/qemu $ ./qemu-io -c "write -P 11 62k 4k" test.qcow2
qcow2_co_writev off=f800 bytes nr=1000 bytes
wrote 4096/4096 bytes at offset 63488
4 KiB, 1 ops; 0.1349 sec (29.642 KiB/sec and 7.4106 ops/sec)

# write zeroes to 2 first data clusters into top image
hades ~/src/qemu $ ./qemu-io -c "write -z 62k 4k" test2.qcow2
qcow2_co_write_zeroes off=0xf800 bytes nr=0x1000 bytes
is_zero_cluster off=0x0 bytes res=0x50015 ZERO=0 DATA=1
qcow2_co_writev off=f800 bytes nr=1000 bytes
wrote 4096/4096 bytes at offset 63488
4 KiB, 1 ops; 0.1371 sec (29.167 KiB/sec and 7.2919 ops/sec)
# so far, so good. No zeroes, data is reported

#writing to top image at offset 252 Kb (unallocated in both images)
hades ~/src/qemu $ ./qemu-io -c "write -z 254k 4k" test2.qcow2
qcow2_co_write_zeroes off=0x3f800 bytes nr=0x1000 bytes
is_zero_cluster off=0x30000 bytes res=0x2 ZERO=1 DATA=0
is_zero_cluster off=0x40000 bytes res=0x2 ZERO=1 DATA=0
is_zero_cluster_top_locked off=0x30000 bytes ret=0x0
is_zero_cluster_top_locked off=0x40000 bytes ret=0x0
wrote 4096/4096 bytes at offset 260096
4 KiB, 1 ops; 0.0297 sec (134.391 KiB/sec and 33.5976 ops/sec)
hades ~/src/qemu $
# you correct, zeroes reported, but data is not reported too.
# may be my condition means the same in both parts.

OK. Can be cleaned up.



>> +}
>> +
>> +static bool is_zero_cluster_top_locked(BlockDriverState *bs, int64_t start)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    int nr = s->cluster_sectors;
>> +    uint64_t off;
>> +    int ret;
>> +
>> +    ret = qcow2_get_cluster_offset(bs, start << BDRV_SECTOR_BITS, &nr, &off);
>> +    return ret == QCOW2_CLUSTER_UNALLOCATED || ret == QCOW2_CLUSTER_ZERO;
>> +}
> This part is new, but it also returns true for clusters that are
> allocated in the backing file.

this is checked only after the first check, i.e. when is_zero_cluster 
has already
returned success. I assume here that backing stores are read-only and 
can not
be changed. In the other case I'll have to call
     bdrv_get_block_status_above(backing_bs(bs), NULL,
with s->lock held which I have not wanted to do.

>>   static coroutine_fn int qcow2_co_write_zeroes(BlockDriverState *bs,
>>       int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>>   {
>>       int ret;
>>       BDRVQcow2State *s = bs->opaque;
>>   
>> -    /* Emulate misaligned zero writes */
>> -    if (sector_num % s->cluster_sectors || nb_sectors % s->cluster_sectors) {
>> -        return -ENOTSUP;
>> +    int head = sector_num % s->cluster_sectors;
>> +    int tail = (sector_num + nb_sectors) % s->cluster_sectors;
>> +
>> +    if (head != 0 || tail != 0) {
>> +        int64_t cl_end = -1;
>> +
>> +        sector_num -= head;
>> +        nb_sectors += head;
>> +
>> +        if (tail != 0) {
>> +            nb_sectors += s->cluster_sectors - tail;
>> +        }
>> +
>> +        if (!is_zero_cluster(bs, sector_num)) {
>> +            return -ENOTSUP;
>> +        }
>> +
>> +        if (nb_sectors > s->cluster_sectors) {
>> +            /* Technically the request can cover 2 clusters, f.e. 4k write
>> +               at s->cluster_sectors - 2k offset. One of these cluster can
>> +               be zeroed, one unallocated */
> This cannot happen. bdrv_co_write_zeroes() splits requests not only
> based on the length, but so that they are actually aligned at cluster
> boundaries.
This do happens.

hades ~/src/qemu $ ./qemu-io -c "write -z 62k 4k" test.qcow2
qcow2_co_write_zeroes off=0xf800 bytes nr=0x1000 bytes
wrote 4096/4096 bytes at offset 63488
4 KiB, 1 ops; 0.0670 sec (59.662 KiB/sec and 14.9156 ops/sec)
hades ~/src/qemu $ ./qemu-img info test.qcow2
image: test.qcow2
file format: qcow2
virtual size: 64G (68719476736 bytes)
disk size: 260K
cluster_size: 65536
Format specific information:
     compat: 1.1
     lazy refcounts: false
     refcount bits: 16
     corrupt: false
hades ~/src/qemu $

Does this mean that I should check upper layer and fix?


>> +            cl_end = sector_num + nb_sectors - s->cluster_sectors;
>> +            if (!is_zero_cluster(bs, cl_end)) {
>> +                return -ENOTSUP;
>> +            }
>> +        }
>> +
>> +        qemu_co_mutex_lock(&s->lock);
>> +        /* We can have new write after previous check */
>> +        if (!is_zero_cluster_top_locked(bs, sector_num) ||
>> +                (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
>> +            qemu_co_mutex_unlock(&s->lock);
>> +            return -ENOTSUP;
>> +        }
> Just lock the mutex before the check, the possible optimisation for the
> emulation case (which is slow anyway) isn't worth the additional code
> complexity.

bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not
recursive thus the code will hang. This is the problem trying to be
addressed with this split of checks.

May be we could make the lock recursive...



>> +    } else {
>> +        qemu_co_mutex_lock(&s->lock);
>>       }
>>   
>>       /* Whatever is left can use real zero clusters */
>> -    qemu_co_mutex_lock(&s->lock);
>> -    ret = qcow2_zero_clusters(bs, sector_num << BDRV_SECTOR_BITS,
>> -        nb_sectors);
>> +    ret = qcow2_zero_clusters(bs, sector_num << BDRV_SECTOR_BITS, nb_sectors);
>>       qemu_co_mutex_unlock(&s->lock);
> In the end, I think v2 with just the bugs fixed (those that I mentioned
> in the v2 review, plus the locking problem that you found) would be
> better than this version.
>
> Kevin

Den

Kevin Wolf May 12, 2016, 10:37 a.m. UTC | #3

Am 12.05.2016 um 11:00 hat Denis V. Lunev geschrieben:
> On 05/11/2016 02:28 PM, Kevin Wolf wrote:
> >Am 11.05.2016 um 09:00 hat Denis V. Lunev geschrieben:
> >>There is a possibility that qcow2_co_write_zeroes() will be called
> >>with the partial block. This could be synthetically triggered with
> >>     qemu-io -c "write -z 32k 4k"
> >>and can happen in the real life in qemu-nbd. The latter happens under
> >>the following conditions:
> >>     (1) qemu-nbd is started with --detect-zeroes=on and is connected to the
> >>         kernel NBD client
> >>     (2) third party program opens kernel NBD device with O_DIRECT
> >>     (3) third party program performs write operation with memory buffer
> >>         not aligned to the page
> >>In this case qcow2_co_write_zeroes() is unable to perform the operation
> >>and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller
> >>switches to non-optimized version and writes real zeroes to the disk.
> >>
> >>The patch creates a shortcut. If the block is read as zeroes, f.e. if
> >>it is unallocated, the request is extended to cover full block.
> >>User-visible situation with this block is not changed. Before the patch
> >>the block is filled in the image with real zeroes. After that patch the
> >>block is marked as zeroed in metadata. Thus any subsequent changes in
> >>backing store chain are not affected.
> >>
> >>Kevin, thank you for a cool suggestion.
> >>
> >>Signed-off-by: Denis V. Lunev <den@openvz.org>
> >>Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> >>CC: Kevin Wolf <kwolf@redhat.com>
> >>CC: Max Reitz <mreitz@redhat.com>
> >>---
> >>Changes from v2:
> >>- checked head/tail clusters separately (one can be zeroed, one unallocated)
> >>- fixed range calculations
> >>- fixed race when the block can become used just after the check
> >>- fixed zero cluster detection
> >>- minor tweaks in the description
> >>
> >>Changes from v1:
> >>- description rewritten completely
> >>- new approach suggested by Kevin is implemented
> >>
> >>  block/qcow2.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
> >>  1 file changed, 59 insertions(+), 6 deletions(-)
> >>
> >>diff --git a/block/qcow2.c b/block/qcow2.c
> >>index 470734b..c2474c1 100644
> >>--- a/block/qcow2.c
> >>+++ b/block/qcow2.c
> >>@@ -2411,21 +2411,74 @@ finish:
> >>      return ret;
> >>  }
> >>+
> >>+static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
> >>+{
> >>+    BDRVQcow2State *s = bs->opaque;
> >>+    int nr;
> >>+    BlockDriverState *file;
> >>+    int64_t res = bdrv_get_block_status_above(bs, NULL, start,
> >>+                                              s->cluster_sectors, &nr, &file);
> >>+    return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA));
> >Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that
> >all unallocated clusters return true, even if the backing file contains
> >non-zero data for them.
> 
> this is correct. From my POW this means that this area is unallocated
> in the entire backing chain and thus it will be read as zeroes. Thus
> we could cover it with zeroes.

You're right that I made a mistake, I was thinking of the non-recursive
bdrv_get_block_status().

However, I still think that we may not assume that !BDRV_BLOCK_DATA
means zero data, even though that affects only more obscure cases. We
have bdrv_unallocated_blocks_are_zero() to check whether the assumption
is true. However, bdrv_co_get_block_status() already checks this
internally and sets BDRV_BLOCK_ZERO in this case, so just checking
BDRV_BLOCK_ZERO in qcow2 should be good.

Did you find a case where you got !DATA, but not ZERO, and assuming
zeroes was valid? If so, we may need to fix bdrv_co_get_block_status().

> Indeed,
> 
> # create top image
> hades ~/src/qemu $ qemu-img create -f qcow2 test2.qcow2 -b test.qcow2 64G
> Formatting 'test2.qcow2', fmt=qcow2 size=68719476736
> backing_file='test.qcow2' encryption=off cluster_size=65536
> lazy_refcounts=off refcount_bits=16
> 
> # create backing store
> hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G
> Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off
> cluster_size=65536 lazy_refcounts=off refcount_bits=16
> 
> # write to 2 first data clusters into backing store
> hades ~/src/qemu $ ./qemu-io -c "write -P 11 62k 4k" test.qcow2
> qcow2_co_writev off=f800 bytes nr=1000 bytes
> wrote 4096/4096 bytes at offset 63488
> 4 KiB, 1 ops; 0.1349 sec (29.642 KiB/sec and 7.4106 ops/sec)
> 
> # write zeroes to 2 first data clusters into top image
> hades ~/src/qemu $ ./qemu-io -c "write -z 62k 4k" test2.qcow2
> qcow2_co_write_zeroes off=0xf800 bytes nr=0x1000 bytes
> is_zero_cluster off=0x0 bytes res=0x50015 ZERO=0 DATA=1
> qcow2_co_writev off=f800 bytes nr=1000 bytes
> wrote 4096/4096 bytes at offset 63488
> 4 KiB, 1 ops; 0.1371 sec (29.167 KiB/sec and 7.2919 ops/sec)
> # so far, so good. No zeroes, data is reported
> 
> #writing to top image at offset 252 Kb (unallocated in both images)
> hades ~/src/qemu $ ./qemu-io -c "write -z 254k 4k" test2.qcow2
> qcow2_co_write_zeroes off=0x3f800 bytes nr=0x1000 bytes
> is_zero_cluster off=0x30000 bytes res=0x2 ZERO=1 DATA=0
> is_zero_cluster off=0x40000 bytes res=0x2 ZERO=1 DATA=0
> is_zero_cluster_top_locked off=0x30000 bytes ret=0x0
> is_zero_cluster_top_locked off=0x40000 bytes ret=0x0
> wrote 4096/4096 bytes at offset 260096
> 4 KiB, 1 ops; 0.0297 sec (134.391 KiB/sec and 33.5976 ops/sec)
> hades ~/src/qemu $
> # you correct, zeroes reported, but data is not reported too.
> # may be my condition means the same in both parts.
> 
> OK. Can be cleaned up.

I think it means that same in all cases where
bdrv_unallocated_blocks_are_zero(bs) == true.

> >>+}
> >>+
> >>+static bool is_zero_cluster_top_locked(BlockDriverState *bs, int64_t start)
> >>+{
> >>+    BDRVQcow2State *s = bs->opaque;
> >>+    int nr = s->cluster_sectors;
> >>+    uint64_t off;
> >>+    int ret;
> >>+
> >>+    ret = qcow2_get_cluster_offset(bs, start << BDRV_SECTOR_BITS, &nr, &off);
> >>+    return ret == QCOW2_CLUSTER_UNALLOCATED || ret == QCOW2_CLUSTER_ZERO;
> >>+}
> >This part is new, but it also returns true for clusters that are
> >allocated in the backing file.
> 
> this is checked only after the first check, i.e. when
> is_zero_cluster has already
> returned success. I assume here that backing stores are read-only
> and can not
> be changed. In the other case I'll have to call
>     bdrv_get_block_status_above(backing_bs(bs), NULL,
> with s->lock held which I have not wanted to do.
> 
> >>  static coroutine_fn int qcow2_co_write_zeroes(BlockDriverState *bs,
> >>      int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
> >>  {
> >>      int ret;
> >>      BDRVQcow2State *s = bs->opaque;
> >>-    /* Emulate misaligned zero writes */
> >>-    if (sector_num % s->cluster_sectors || nb_sectors % s->cluster_sectors) {
> >>-        return -ENOTSUP;
> >>+    int head = sector_num % s->cluster_sectors;
> >>+    int tail = (sector_num + nb_sectors) % s->cluster_sectors;
> >>+
> >>+    if (head != 0 || tail != 0) {
> >>+        int64_t cl_end = -1;
> >>+
> >>+        sector_num -= head;
> >>+        nb_sectors += head;
> >>+
> >>+        if (tail != 0) {
> >>+            nb_sectors += s->cluster_sectors - tail;
> >>+        }
> >>+
> >>+        if (!is_zero_cluster(bs, sector_num)) {
> >>+            return -ENOTSUP;
> >>+        }
> >>+
> >>+        if (nb_sectors > s->cluster_sectors) {
> >>+            /* Technically the request can cover 2 clusters, f.e. 4k write
> >>+               at s->cluster_sectors - 2k offset. One of these cluster can
> >>+               be zeroed, one unallocated */
> >This cannot happen. bdrv_co_write_zeroes() splits requests not only
> >based on the length, but so that they are actually aligned at cluster
> >boundaries.
> This do happens.
> 
> hades ~/src/qemu $ ./qemu-io -c "write -z 62k 4k" test.qcow2
> qcow2_co_write_zeroes off=0xf800 bytes nr=0x1000 bytes
> wrote 4096/4096 bytes at offset 63488
> 4 KiB, 1 ops; 0.0670 sec (59.662 KiB/sec and 14.9156 ops/sec)
> hades ~/src/qemu $ ./qemu-img info test.qcow2
> image: test.qcow2
> file format: qcow2
> virtual size: 64G (68719476736 bytes)
> disk size: 260K
> cluster_size: 65536
> Format specific information:
>     compat: 1.1
>     lazy refcounts: false
>     refcount bits: 16
>     corrupt: false
> hades ~/src/qemu $
> 
> Does this mean that I should check upper layer and fix?

Hm, I see:

    if (bs->bl.write_zeroes_alignment
        && num > bs->bl.write_zeroes_alignment) {

Removing the second part should fix this, i.e. it would split a request
into two unaligned halves even if there is no aligned "bulk" in the
middle.

I think it would match my expectations better, but maybe that's just me.
What do you think?

> >>+            cl_end = sector_num + nb_sectors - s->cluster_sectors;
> >>+            if (!is_zero_cluster(bs, cl_end)) {
> >>+                return -ENOTSUP;
> >>+            }
> >>+        }
> >>+
> >>+        qemu_co_mutex_lock(&s->lock);
> >>+        /* We can have new write after previous check */
> >>+        if (!is_zero_cluster_top_locked(bs, sector_num) ||
> >>+                (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
> >>+            qemu_co_mutex_unlock(&s->lock);
> >>+            return -ENOTSUP;
> >>+        }
> >Just lock the mutex before the check, the possible optimisation for the
> >emulation case (which is slow anyway) isn't worth the additional code
> >complexity.
> 
> bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not
> recursive thus the code will hang. This is the problem trying to be
> addressed with this split of checks.
> 
> May be we could make the lock recursive...

Maybe your version is no far from the best we can do then. It deserves a
comment, though, because it's not completely obvious.

The other option that we have and that looks reasonable enough to me is
checking is_zero_cluster_top_locked() first and only if that returns
false, we check the block status of the backing chain, starting at
bs->backing->bs. This way we would bypass the recursive call and could
take the lock from the beginning. If we go that way, it deserves a
comment as well.

Kevin

Denis V. Lunev May 13, 2016, 4:09 p.m. UTC | #4

On 05/12/2016 01:37 PM, Kevin Wolf wrote:
> Am 12.05.2016 um 11:00 hat Denis V. Lunev geschrieben:
>> On 05/11/2016 02:28 PM, Kevin Wolf wrote:
>>> Am 11.05.2016 um 09:00 hat Denis V. Lunev geschrieben:
>>>> There is a possibility that qcow2_co_write_zeroes() will be called
>>>> with the partial block. This could be synthetically triggered with
>>>>      qemu-io -c "write -z 32k 4k"
>>>> and can happen in the real life in qemu-nbd. The latter happens under
>>>> the following conditions:
>>>>      (1) qemu-nbd is started with --detect-zeroes=on and is connected to the
>>>>          kernel NBD client
>>>>      (2) third party program opens kernel NBD device with O_DIRECT
>>>>      (3) third party program performs write operation with memory buffer
>>>>          not aligned to the page
>>>> In this case qcow2_co_write_zeroes() is unable to perform the operation
>>>> and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller
>>>> switches to non-optimized version and writes real zeroes to the disk.
>>>>
>>>> The patch creates a shortcut. If the block is read as zeroes, f.e. if
>>>> it is unallocated, the request is extended to cover full block.
>>>> User-visible situation with this block is not changed. Before the patch
>>>> the block is filled in the image with real zeroes. After that patch the
>>>> block is marked as zeroed in metadata. Thus any subsequent changes in
>>>> backing store chain are not affected.
>>>>
>>>> Kevin, thank you for a cool suggestion.
>>>>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>> CC: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>> Changes from v2:
>>>> - checked head/tail clusters separately (one can be zeroed, one unallocated)
>>>> - fixed range calculations
>>>> - fixed race when the block can become used just after the check
>>>> - fixed zero cluster detection
>>>> - minor tweaks in the description
>>>>
>>>> Changes from v1:
>>>> - description rewritten completely
>>>> - new approach suggested by Kevin is implemented
>>>>
>>>>   block/qcow2.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
>>>>   1 file changed, 59 insertions(+), 6 deletions(-)
oops, the patch gets committed... that is unexpected but great ;)



>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>> index 470734b..c2474c1 100644
>>>> --- a/block/qcow2.c
>>>> +++ b/block/qcow2.c
>>>> @@ -2411,21 +2411,74 @@ finish:
>>>>       return ret;
>>>>   }
>>>> +
>>>> +static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
>>>> +{
>>>> +    BDRVQcow2State *s = bs->opaque;
>>>> +    int nr;
>>>> +    BlockDriverState *file;
>>>> +    int64_t res = bdrv_get_block_status_above(bs, NULL, start,
>>>> +                                              s->cluster_sectors, &nr, &file);
>>>> +    return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA));
>>> Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that
>>> all unallocated clusters return true, even if the backing file contains
>>> non-zero data for them.
>> this is correct. From my POW this means that this area is unallocated
>> in the entire backing chain and thus it will be read as zeroes. Thus
>> we could cover it with zeroes.
> You're right that I made a mistake, I was thinking of the non-recursive
> bdrv_get_block_status().
>
> However, I still think that we may not assume that !BDRV_BLOCK_DATA
> means zero data, even though that affects only more obscure cases. We
> have bdrv_unallocated_blocks_are_zero() to check whether the assumption
> is true. However, bdrv_co_get_block_status() already checks this
> internally and sets BDRV_BLOCK_ZERO in this case, so just checking
> BDRV_BLOCK_ZERO in qcow2 should be good.
>
> Did you find a case where you got !DATA, but not ZERO, and assuming
> zeroes was valid? If so, we may need to fix bdrv_co_get_block_status().
actually we may have the following case (artificial)!:
- assuming we do not have bdrv_has_zero_init in backing store
- and qcow2 on top of this file
- reading from unallocated block should return 0 (no data in both 
places), qcow2
   layer will return 0

It looks like we will have this situation.

[skipped]

> Hm, I see:
>
>      if (bs->bl.write_zeroes_alignment
>          && num > bs->bl.write_zeroes_alignment) {
>
> Removing the second part should fix this, i.e. it would split a request
> into two unaligned halves even if there is no aligned "bulk" in the
> middle.
>
> I think it would match my expectations better, but maybe that's just me.
> What do you think?
actually the code here will not be significantly better (I presume),
but I'll make a try



>>>> +            cl_end = sector_num + nb_sectors - s->cluster_sectors;
>>>> +            if (!is_zero_cluster(bs, cl_end)) {
>>>> +                return -ENOTSUP;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        qemu_co_mutex_lock(&s->lock);
>>>> +        /* We can have new write after previous check */
>>>> +        if (!is_zero_cluster_top_locked(bs, sector_num) ||
>>>> +                (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
>>>> +            qemu_co_mutex_unlock(&s->lock);
>>>> +            return -ENOTSUP;
>>>> +        }
>>> Just lock the mutex before the check, the possible optimisation for the
>>> emulation case (which is slow anyway) isn't worth the additional code
>>> complexity.
>> bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not
>> recursive thus the code will hang. This is the problem trying to be
>> addressed with this split of checks.
>>
>> May be we could make the lock recursive...
> Maybe your version is no far from the best we can do then. It deserves a
> comment, though, because it's not completely obvious.
>
> The other option that we have and that looks reasonable enough to me is
> checking is_zero_cluster_top_locked() first and only if that returns
> false, we check the block status of the backing chain, starting at
> bs->backing->bs. This way we would bypass the recursive call and could
> take the lock from the beginning. If we go that way, it deserves a
> comment as well.
>
> Kevin
OK. I'll send at least improved comments and (may be)
removal of "&& num > bs->bl.write_zeroes_alignment"
as follow up.

thank you for ideas ;)

Den

Kevin Wolf May 13, 2016, 4:24 p.m. UTC | #5

Am 13.05.2016 um 18:09 hat Denis V. Lunev geschrieben:
> oops, the patch gets committed... that is unexpected but great ;)

Oops, that was not intended, I just forgot to remove it from the queue
when I realised that it's not ready yet.

Should I stage a revert or do you prefer to fix it on top?

> >>>>diff --git a/block/qcow2.c b/block/qcow2.c
> >>>>index 470734b..c2474c1 100644
> >>>>--- a/block/qcow2.c
> >>>>+++ b/block/qcow2.c
> >>>>@@ -2411,21 +2411,74 @@ finish:
> >>>>      return ret;
> >>>>  }
> >>>>+
> >>>>+static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
> >>>>+{
> >>>>+    BDRVQcow2State *s = bs->opaque;
> >>>>+    int nr;
> >>>>+    BlockDriverState *file;
> >>>>+    int64_t res = bdrv_get_block_status_above(bs, NULL, start,
> >>>>+                                              s->cluster_sectors, &nr, &file);
> >>>>+    return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA));
> >>>Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that
> >>>all unallocated clusters return true, even if the backing file contains
> >>>non-zero data for them.
> >>this is correct. From my POW this means that this area is unallocated
> >>in the entire backing chain and thus it will be read as zeroes. Thus
> >>we could cover it with zeroes.
> >You're right that I made a mistake, I was thinking of the non-recursive
> >bdrv_get_block_status().
> >
> >However, I still think that we may not assume that !BDRV_BLOCK_DATA
> >means zero data, even though that affects only more obscure cases. We
> >have bdrv_unallocated_blocks_are_zero() to check whether the assumption
> >is true. However, bdrv_co_get_block_status() already checks this
> >internally and sets BDRV_BLOCK_ZERO in this case, so just checking
> >BDRV_BLOCK_ZERO in qcow2 should be good.
> >
> >Did you find a case where you got !DATA, but not ZERO, and assuming
> >zeroes was valid? If so, we may need to fix bdrv_co_get_block_status().
> actually we may have the following case (artificial)!:
> - assuming we do not have bdrv_has_zero_init in backing store
> - and qcow2 on top of this file
> - reading from unallocated block should return 0 (no data in both
> places), qcow2
>   layer will return 0
> 
> It looks like we will have this situation.

qcow2 sets bdi->unallocated_blocks_are_zero = true, so in this case you
should correctly get BDRV_BLOCK_ZERO from bdrv_get_block_status().

> >Hm, I see:
> >
> >     if (bs->bl.write_zeroes_alignment
> >         && num > bs->bl.write_zeroes_alignment) {
> >
> >Removing the second part should fix this, i.e. it would split a request
> >into two unaligned halves even if there is no aligned "bulk" in the
> >middle.
> >
> >I think it would match my expectations better, but maybe that's just me.
> >What do you think?
> actually the code here will not be significantly better (I presume),
> but I'll make a try

Yes, I agree that it won't make the qcow2 code significantly simpler. I
just think that it would be less surprising semantics.

> >>>>+            cl_end = sector_num + nb_sectors - s->cluster_sectors;
> >>>>+            if (!is_zero_cluster(bs, cl_end)) {
> >>>>+                return -ENOTSUP;
> >>>>+            }
> >>>>+        }
> >>>>+
> >>>>+        qemu_co_mutex_lock(&s->lock);
> >>>>+        /* We can have new write after previous check */
> >>>>+        if (!is_zero_cluster_top_locked(bs, sector_num) ||
> >>>>+                (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
> >>>>+            qemu_co_mutex_unlock(&s->lock);
> >>>>+            return -ENOTSUP;
> >>>>+        }
> >>>Just lock the mutex before the check, the possible optimisation for the
> >>>emulation case (which is slow anyway) isn't worth the additional code
> >>>complexity.
> >>bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not
> >>recursive thus the code will hang. This is the problem trying to be
> >>addressed with this split of checks.
> >>
> >>May be we could make the lock recursive...
> >Maybe your version is no far from the best we can do then. It deserves a
> >comment, though, because it's not completely obvious.
> >
> >The other option that we have and that looks reasonable enough to me is
> >checking is_zero_cluster_top_locked() first and only if that returns
> >false, we check the block status of the backing chain, starting at
> >bs->backing->bs. This way we would bypass the recursive call and could
> >take the lock from the beginning. If we go that way, it deserves a
> >comment as well.
> >
> >Kevin
> OK. I'll send at least improved comments and (may be)
> removal of "&& num > bs->bl.write_zeroes_alignment"
> as follow up.

The most important part is actually fixing is_zero_cluster(), because
that's a real bug that can corrupt data in the !has_zero_init case. This
is also the reason why I would revert the patch if we don't have a fix
for this until my next pull request.

The rest is just making things a bit nicer, so follow-ups are very
welcome, but not as critical.

Kevin

Denis V. Lunev May 13, 2016, 4:37 p.m. UTC | #6

On 05/13/2016 07:24 PM, Kevin Wolf wrote:
> Am 13.05.2016 um 18:09 hat Denis V. Lunev geschrieben:
>> oops, the patch gets committed... that is unexpected but great ;)
> Oops, that was not intended, I just forgot to remove it from the queue
> when I realised that it's not ready yet.
>
> Should I stage a revert or do you prefer to fix it on top?
I'd better do this on top. You will have the fix tomorrow.


>>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>>>> index 470734b..c2474c1 100644
>>>>>> --- a/block/qcow2.c
>>>>>> +++ b/block/qcow2.c
>>>>>> @@ -2411,21 +2411,74 @@ finish:
>>>>>>       return ret;
>>>>>>   }
>>>>>> +
>>>>>> +static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
>>>>>> +{
>>>>>> +    BDRVQcow2State *s = bs->opaque;
>>>>>> +    int nr;
>>>>>> +    BlockDriverState *file;
>>>>>> +    int64_t res = bdrv_get_block_status_above(bs, NULL, start,
>>>>>> +                                              s->cluster_sectors, &nr, &file);
>>>>>> +    return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA));
>>>>> Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that
>>>>> all unallocated clusters return true, even if the backing file contains
>>>>> non-zero data for them.
>>>> this is correct. From my POW this means that this area is unallocated
>>>> in the entire backing chain and thus it will be read as zeroes. Thus
>>>> we could cover it with zeroes.
>>> You're right that I made a mistake, I was thinking of the non-recursive
>>> bdrv_get_block_status().
>>>
>>> However, I still think that we may not assume that !BDRV_BLOCK_DATA
>>> means zero data, even though that affects only more obscure cases. We
>>> have bdrv_unallocated_blocks_are_zero() to check whether the assumption
>>> is true. However, bdrv_co_get_block_status() already checks this
>>> internally and sets BDRV_BLOCK_ZERO in this case, so just checking
>>> BDRV_BLOCK_ZERO in qcow2 should be good.
>>>
>>> Did you find a case where you got !DATA, but not ZERO, and assuming
>>> zeroes was valid? If so, we may need to fix bdrv_co_get_block_status().
>> actually we may have the following case (artificial)!:
>> - assuming we do not have bdrv_has_zero_init in backing store
>> - and qcow2 on top of this file
>> - reading from unallocated block should return 0 (no data in both
>> places), qcow2
>>    layer will return 0
>>
>> It looks like we will have this situation.
> qcow2 sets bdi->unallocated_blocks_are_zero = true, so in this case you
> should correctly get BDRV_BLOCK_ZERO from bdrv_get_block_status().
I will make a check.

>>> Hm, I see:
>>>
>>>      if (bs->bl.write_zeroes_alignment
>>>          && num > bs->bl.write_zeroes_alignment) {
>>>
>>> Removing the second part should fix this, i.e. it would split a request
>>> into two unaligned halves even if there is no aligned "bulk" in the
>>> middle.
>>>
>>> I think it would match my expectations better, but maybe that's just me.
>>> What do you think?
>> actually the code here will not be significantly better (I presume),
>> but I'll make a try
> Yes, I agree that it won't make the qcow2 code significantly simpler. I
> just think that it would be less surprising semantics.
>
>>>>>> +            cl_end = sector_num + nb_sectors - s->cluster_sectors;
>>>>>> +            if (!is_zero_cluster(bs, cl_end)) {
>>>>>> +                return -ENOTSUP;
>>>>>> +            }
>>>>>> +        }
>>>>>> +
>>>>>> +        qemu_co_mutex_lock(&s->lock);
>>>>>> +        /* We can have new write after previous check */
>>>>>> +        if (!is_zero_cluster_top_locked(bs, sector_num) ||
>>>>>> +                (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
>>>>>> +            qemu_co_mutex_unlock(&s->lock);
>>>>>> +            return -ENOTSUP;
>>>>>> +        }
>>>>> Just lock the mutex before the check, the possible optimisation for the
>>>>> emulation case (which is slow anyway) isn't worth the additional code
>>>>> complexity.
>>>> bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not
>>>> recursive thus the code will hang. This is the problem trying to be
>>>> addressed with this split of checks.
>>>>
>>>> May be we could make the lock recursive...
>>> Maybe your version is no far from the best we can do then. It deserves a
>>> comment, though, because it's not completely obvious.
>>>
>>> The other option that we have and that looks reasonable enough to me is
>>> checking is_zero_cluster_top_locked() first and only if that returns
>>> false, we check the block status of the backing chain, starting at
>>> bs->backing->bs. This way we would bypass the recursive call and could
>>> take the lock from the beginning. If we go that way, it deserves a
>>> comment as well.
>>>
>>> Kevin
>> OK. I'll send at least improved comments and (may be)
>> removal of "&& num > bs->bl.write_zeroes_alignment"
>> as follow up.
> The most important part is actually fixing is_zero_cluster(), because
> that's a real bug that can corrupt data in the !has_zero_init case. This
> is also the reason why I would revert the patch if we don't have a fix
> for this until my next pull request.
>
> The rest is just making things a bit nicer, so follow-ups are very
> welcome, but not as critical.
>
> Kevin
you will have the fixup tomorrow. I don't want
to touch this too late.

Den

[for,2.7,v3,1/1] qcow2: improve qcow2_co_write_zeroes()

Commit Message

Comments

Patch