diff mbox

[v10,09/17] qcow2: Optimize write zero of unaligned tail cluster

Message ID 20170427014626.11553-10-eblake@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eric Blake April 27, 2017, 1:46 a.m. UTC
We've already improved discards to operate efficiently on the tail
of an unaligned qcow2 image; it's time to make a similar improvement
to write zeroes.  The special case is only valid at the tail
cluster of a file, where we must recognize that any sectors beyond
the image end would implicitly read as zero, and therefore should
not penalize our logic for widening a partial cluster into writing
the whole cluster as zero.

Update test 154 to cover the new scenarios, using two images of
intentionally differing length.

While at it, fix the test to gracefully skip when run as
./check -qcow2 -o compat=0.10 154
since the older format lacks zero clusters already required earlier
in the test.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v10: rebase to better reporting of preallocated zero clusters
v9: new patch
---
 block/qcow2.c              |   7 +++
 tests/qemu-iotests/154     | 112 ++++++++++++++++++++++++++++++++++++++++++++-
 tests/qemu-iotests/154.out |  83 +++++++++++++++++++++++++++++++++
 3 files changed, 200 insertions(+), 2 deletions(-)

Comments

Max Reitz April 28, 2017, 8:48 p.m. UTC | #1
On 27.04.2017 03:46, Eric Blake wrote:
> We've already improved discards to operate efficiently on the tail
> of an unaligned qcow2 image; it's time to make a similar improvement
> to write zeroes.  The special case is only valid at the tail
> cluster of a file, where we must recognize that any sectors beyond
> the image end would implicitly read as zero, and therefore should
> not penalize our logic for widening a partial cluster into writing
> the whole cluster as zero.
> 
> Update test 154 to cover the new scenarios, using two images of
> intentionally differing length.
> 
> While at it, fix the test to gracefully skip when run as
> ./check -qcow2 -o compat=0.10 154
> since the older format lacks zero clusters already required earlier
> in the test.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> v10: rebase to better reporting of preallocated zero clusters
> v9: new patch
> ---
>  block/qcow2.c              |   7 +++
>  tests/qemu-iotests/154     | 112 ++++++++++++++++++++++++++++++++++++++++++++-
>  tests/qemu-iotests/154.out |  83 +++++++++++++++++++++++++++++++++
>  3 files changed, 200 insertions(+), 2 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 5c1573c..8038793 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2449,6 +2449,10 @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
>      BlockDriverState *file;
>      int64_t res;
> 
> +    if (start + count > bs->total_sectors) {
> +        count = bs->total_sectors - start;
> +    }
> +
>      if (!count) {
>          return true;
>      }
> @@ -2467,6 +2471,9 @@ static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
>      uint32_t tail = (offset + count) % s->cluster_size;
> 
>      trace_qcow2_pwrite_zeroes_start_req(qemu_coroutine_self(), offset, count);
> +    if (offset + count == bs->total_sectors * BDRV_SECTOR_SIZE) {
> +        tail = 0;
> +    }
> 
>      if (head || tail) {
>          int64_t cl_start = (offset - head) >> BDRV_SECTOR_BITS;
> diff --git a/tests/qemu-iotests/154 b/tests/qemu-iotests/154
> index 7ca7219..005f09f 100755
> --- a/tests/qemu-iotests/154
> +++ b/tests/qemu-iotests/154
> @@ -2,7 +2,7 @@
>  #
>  # qcow2 specific bdrv_pwrite_zeroes tests with backing files (complements 034)
>  #
> -# Copyright (C) 2016 Red Hat, Inc.
> +# Copyright (C) 2016-2017 Red Hat, Inc.
>  #
>  # This program is free software; you can redistribute it and/or modify
>  # it under the terms of the GNU General Public License as published by
> @@ -42,7 +42,10 @@ _supported_proto file
>  _supported_os Linux
> 
>  CLUSTER_SIZE=4k
> -size=128M
> +size=$((128 * 1024 * 1024))
> +
> +# This test requires zero clusters, added in v3 images
> +_unsupported_imgopts compat=0.10
> 
>  echo
>  echo == backing file contains zeros ==
> @@ -299,6 +302,111 @@ $QEMU_IO -c "read -P 0 75k 1k" "$TEST_IMG" | _filter_qemu_io
> 
>  $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
> 
> +echo
> +echo == unaligned image tail cluster, no allocation needed ==
> +
> +CLUSTER_SIZE=1024 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))

Any reason for the CLUSTER_SIZE? It passes with 64 kB as well, and I
actually think that would be a better test because as it is, the image's
"unaligned" tail is exactly one cluster (so it isn't really unaligned).

> +_make_test_img -b "$TEST_IMG.base" $((size + 2048))
> +
> +# With no backing file, write to all or part of unallocated partial cluster
> +
> +# Backing file: 128m: ... | 00 --

Not sure what the "00 --" is supposed to mean. The cluster size is 1 kB,
so this is just a single cluster; which will be converted from
unallocated to a zero cluster because the tail reads as zero.

> +$QEMU_IO -c "write -z $size 512" "$TEST_IMG.base" | _filter_qemu_io
> +
> +# Backing file: 128m: ... | -- 00

Same here, but now it's the head that reads as zero (and it's already a
zero cluster so nothing changes here).

> +$QEMU_IO -c "write -z $size 512" "$TEST_IMG.base" | _filter_qemu_io

But I suppose you mean $((size + 512))?

> +
> +# Backing file: 128m: ... | 00 00

And finally this doesn't change anything either -- but then again this
is a fully aligned request, so I don't quite see the point in testing
this here.

> +$QEMU_IO -c "write -z $size 1024" "$TEST_IMG.base" | _filter_qemu_io
> +
> +# No offset should be allocated, although the cluster itself is considered
> +# allocated due to the zero flag
> +$QEMU_IO -c "alloc $size 1024" "$TEST_IMG.base" | _filter_qemu_io
> +$QEMU_IMG map --output=json "$TEST_IMG.base" | _filter_qemu_img_map
> +
> +# With backing file that reads as zero, write to all or part of entire cluster
> +
> +# Backing file: 128m: ... | 00 00
> +# Active layer: 128m: ... | 00 00 00 00
> +$QEMU_IO -c "write -z $size 2048" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
> +
> +# Backing file: 128m: ... | 00 00
> +# Active layer: 128m: ... | 00 00 -- --

How so? Shouldn't this just stay a zero cluster because the rest of the
cluster already reads as zero?

> +$QEMU_IO -c "write -z $((size)) 1024" "$TEST_IMG" | _filter_qemu_io
> +
> +# Backing file: 128m: ... | 00 00
> +# Active layer: 128m: ... | -- -- 00 00

Same here.

> +$QEMU_IO -c "write -z $((size + 1024)) 1024" "$TEST_IMG" | _filter_qemu_io
> +
> +# Backing file: 128m: ... | 00 00
> +# Active layer: 128m: ... | -- 00 00 --

Same here.

Also, I'm not quite sure what you are testing by just issuing three
consecutive write requests without checking the result each time. You
assume it to be something which you wrote in the comments above each,
but you can't be sure (yes, if something goes wrong in between, the
following final map should catch it, but it's not for certain).

> +$QEMU_IO -c "write -z $((size + 512)) 1024" "$TEST_IMG" | _filter_qemu_io
> +
> +# No offset should be allocated, although the cluster itself is considered
> +# allocated due to the zero flag
> +$QEMU_IO -c "alloc $size 2048" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
> +
> +# Preallocated cluster is not freed when unmap is not requested
> +
> +# Backing file: 128m: ... | XX -- => ... | XX 00
> +$QEMU_IO -c "write $((size)) 512" "$TEST_IMG.base" | _filter_qemu_io
> +$QEMU_IO -c "write -z $((size + 512)) 512" "$TEST_IMG.base" | _filter_qemu_io
> +
> +# Backing file: 128m: ... | XX 00 => ... | 00 00
> +$QEMU_IO -c "write -z $((size)) 1024" "$TEST_IMG.base" | _filter_qemu_io
> +$QEMU_IMG map --output=json "$TEST_IMG.base" | _filter_qemu_img_map |
> +    sed 's/"offset": [0-9]*/"offset": OFFSET/g'
> +
> +echo
> +echo == unaligned image tail cluster, allocation required ==
> +
> +CLUSTER_SIZE=512 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))
> +_make_test_img -b "$TEST_IMG.base" $((size + 2048))
> +
> +# Write beyond backing file must COW
> +# Backing file: 128m: ... | XX --

Here, the "--" is an unallocated cluster...

> +# Active layer: 128m: ... | -- -- 00 --

...while here it is normally allocated data. Or is "--" supposed to mean
you just don't write to that area...?

> +
> +$QEMU_IO -c "write -P 1 $((size)) 512" "$TEST_IMG.base" | _filter_qemu_io
> +$QEMU_IO -c "write -z $((size + 1024)) 512" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c "read -P 1 $((size)) 512" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c "read -P 0 $((size + 512)) 1536" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
> +
> +CLUSTER_SIZE=512 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))
> +_make_test_img -b "$TEST_IMG.base" $((size + 2048))
> +
> +# Writes at boundaries of (partial) cluster must not lose mid-cluster data
> +# Backing file: 128m: ... | -- XX
> +# Active layer: 128m: ... | 00 -- -- 00
> +
> +$QEMU_IO -c "write -P 1 $((size + 512)) 512" "$TEST_IMG.base" | _filter_qemu_io
> +$QEMU_IO -c "write -z $((size)) 512" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c "read -P 0 $((size)) 512" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c "read -P 1 $((size + 512)) 512" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c "read -P 0 $((size + 1024)) 1024" "$TEST_IMG" | _filter_qemu_> +$QEMU_IO -c "write -z $((size + 1536)) 512" "$TEST_IMG" | _filter_qemu_io

[1]

> +$QEMU_IO -c "read -P 0 $((size)) 512" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c "read -P 1 $((size + 512)) 512" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c "read -P 0 $((size + 1024)) 1024" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
> +
> +CLUSTER_SIZE=512 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))
> +_make_test_img -b "$TEST_IMG.base" $((size + 2048))
> +
> +# Partial write must not lose data
> +# Backing file: 128m: ... | -- --
> +# Active layer: 128m: ... | -- -- XX 00

Well, you have basically tested this in [1] already. After the first
write -z to the active layer, the final four sectors are fully allocated
(as they are in a single cluster) and look like this:

00 01 00 00 = XX XX XX XX

(Because the 00s are just written as data.)

Now the write -z in [1] just overwrites the last of those four sectors
with zero data (albeit it being zero already).

But if you want to test it again, I'm not stopping you. :-)

Max

> +
> +$QEMU_IO -c "write -P 1 $((size + 1024)) 512" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c "write -z $((size + 1536)) 512" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c "read -P 0 $((size)) 1024" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c "read -P 1 $((size + 1024)) 512" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c "read -P 0 $((size + 1536)) 512" "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
> +
>  # success, all done
>  echo "*** done"
>  rm -f $seq.full
> diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
> index da9eabd..f8a09af 100644
> --- a/tests/qemu-iotests/154.out
> +++ b/tests/qemu-iotests/154.out
> @@ -282,4 +282,87 @@ read 1024/1024 bytes at offset 76800
>  { "start": 69632, "length": 4096, "depth": 0, "zero": true, "data": false},
>  { "start": 73728, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": 24576},
>  { "start": 77824, "length": 134139904, "depth": 1, "zero": true, "data": false}]
> +
> +== unaligned image tail cluster, no allocation needed ==
> +Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134218752
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
> +wrote 512/512 bytes at offset 134217728
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 512/512 bytes at offset 134217728
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 1024/1024 bytes at offset 134217728
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +1024/1024 bytes allocated at offset 128 MiB
> +[{ "start": 0, "length": 134218752, "depth": 0, "zero": true, "data": false}]
> +wrote 2048/2048 bytes at offset 134217728
> +2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +[{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
> +wrote 1024/1024 bytes at offset 134217728
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 1024/1024 bytes at offset 134218752
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 1024/1024 bytes at offset 134218240
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +2048/2048 bytes allocated at offset 128 MiB
> +[{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
> +wrote 512/512 bytes at offset 134217728
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 512/512 bytes at offset 134218240
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 1024/1024 bytes at offset 134217728
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +[{ "start": 0, "length": 134217728, "depth": 0, "zero": true, "data": false},
> +{ "start": 134217728, "length": 1024, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
> +
> +== unaligned image tail cluster, allocation required ==
> +Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134218752
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
> +wrote 512/512 bytes at offset 134217728
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 512/512 bytes at offset 134218752
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 134217728
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1536/1536 bytes at offset 134218240
> +1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +[{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": 20480}]
> +Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134218752
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
> +wrote 512/512 bytes at offset 134218240
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 512/512 bytes at offset 134217728
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 134217728
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 134218240
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1024/1024 bytes at offset 134218752
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 512/512 bytes at offset 134219264
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 134217728
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 134218240
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1024/1024 bytes at offset 134218752
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +[{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": 20480}]
> +Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134218752
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
> +wrote 512/512 bytes at offset 134218752
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 512/512 bytes at offset 134219264
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1024/1024 bytes at offset 134217728
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 134218752
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 134219264
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +[{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": 20480}]
>  *** done
>
Eric Blake April 28, 2017, 9:24 p.m. UTC | #2
On 04/28/2017 03:48 PM, Max Reitz wrote:
> On 27.04.2017 03:46, Eric Blake wrote:
>> We've already improved discards to operate efficiently on the tail
>> of an unaligned qcow2 image; it's time to make a similar improvement
>> to write zeroes.  The special case is only valid at the tail
>> cluster of a file, where we must recognize that any sectors beyond
>> the image end would implicitly read as zero, and therefore should
>> not penalize our logic for widening a partial cluster into writing
>> the whole cluster as zero.
>>
>> Update test 154 to cover the new scenarios, using two images of
>> intentionally differing length.
>>
>> While at it, fix the test to gracefully skip when run as
>> ./check -qcow2 -o compat=0.10 154
>> since the older format lacks zero clusters already required earlier
>> in the test.
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>

>>
>> +echo
>> +echo == unaligned image tail cluster, no allocation needed ==
>> +
>> +CLUSTER_SIZE=1024 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))
> 
> Any reason for the CLUSTER_SIZE? It passes with 64 kB as well, and I
> actually think that would be a better test because as it is, the image's
> "unaligned" tail is exactly one cluster (so it isn't really unaligned).

Uhhh - copy-and-paste?  You're right, that 1024 is too small for what
I'm actually doing with it.  :(

> 
>> +_make_test_img -b "$TEST_IMG.base" $((size + 2048))
>> +
>> +# With no backing file, write to all or part of unallocated partial cluster
>> +
>> +# Backing file: 128m: ... | 00 --
> 
> Not sure what the "00 --" is supposed to mean. The cluster size is 1 kB,
> so this is just a single cluster; which will be converted from
> unallocated to a zero cluster because the tail reads as zero.
> 
>> +$QEMU_IO -c "write -z $size 512" "$TEST_IMG.base" | _filter_qemu_io

I'm doing a write of 512 bytes, so I was trying to show that in the
cluster starting at 128m, we have a write request for one sector to be
written to 00, while the rest of the sector is left uninitialized. The
obvious intent is that we note that the uninitialized portion reads as
zero, so we can optimize the entire cluster (even though it is a partial
cluster) to be a cluster write-zero instead of a sector-based allocating
write.  But since you've pointed out that my setup is flawed, I'll have
to double-check that I'm actually testing what I thought I was (I know I
hit the right breakpoints in gdb, but its the iotest expected output
that needs to show the difference between pre- and post-patched qemu
with regards to the partial-tail-cluster optimizations).

>> +
>> +# Backing file: 128m: ... | -- 00
> 
> Same here, but now it's the head that reads as zero (and it's already a
> zero cluster so nothing changes here).
> 
>> +$QEMU_IO -c "write -z $size 512" "$TEST_IMG.base" | _filter_qemu_io
> 
> But I suppose you mean $((size + 512))?

Umm. Yeah.  (Hey - you're not the only one that writes patches late at
night).

> 
>> +
>> +# Backing file: 128m: ... | 00 00
> 
> And finally this doesn't change anything either -- but then again this
> is a fully aligned request, so I don't quite see the point in testing
> this here.

At one point during my development, I had a bug where a partial write
request at the head of the cluster behaved differently from a partial
write request at the end (one allocated while the other did not).

Maybe it's best if I do a single write, then inspect the map, then reset
the image, rather than doing all three writes in a row and proving that
the net result of the three writes had no allocation.  In v9, when I was
trying to use unallocated clusters as a shortcut on files with no
backing image, this actually worked (after all three writes, the cluster
would still be unallocated); but since Kevin convinced me that v10 has
to set the zero flag on all three writes, I'm back to the point of
needing to clear the zero flag between writes.

Okay, I'll definitely have to fix it.

>> +# With backing file that reads as zero, write to all or part of entire cluster
>> +
>> +# Backing file: 128m: ... | 00 00
>> +# Active layer: 128m: ... | 00 00 00 00
>> +$QEMU_IO -c "write -z $size 2048" "$TEST_IMG" | _filter_qemu_io
>> +$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
>> +
>> +# Backing file: 128m: ... | 00 00
>> +# Active layer: 128m: ... | 00 00 -- --
> 
> How so? Shouldn't this just stay a zero cluster because the rest of the
> cluster already reads as zero?

Same problem as above.  In v9, I was trying to exploit code that left a
cluster unallocated; but that's not viable, so I'll have to reset the
image between steps.

> 
> Also, I'm not quite sure what you are testing by just issuing three
> consecutive write requests without checking the result each time. You
> assume it to be something which you wrote in the comments above each,
> but you can't be sure (yes, if something goes wrong in between, the
> following final map should catch it, but it's not for certain).

Well, it helped me during development, but you're absolutely right that
we can't be certain in the long-run.  3 separate checks, with resets
between, is indeed the way to go.


>> +echo
>> +echo == unaligned image tail cluster, allocation required ==
>> +
>> +CLUSTER_SIZE=512 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))

Here, my backing cluster size is intentionally small, so that
bdrv_get_block_status() can easily return sane results for the backing
file portions, so only the active layer has a partial cluster. But
making it obvious in the comments can't hurt, given that my earlier
tests were invalidated when my size choice didn't actually match with
the operations I was attempting.

>> +_make_test_img -b "$TEST_IMG.base" $((size + 2048))
>> +
>> +# Write beyond backing file must COW
>> +# Backing file: 128m: ... | XX --
> 
> Here, the "--" is an unallocated cluster...
> 
>> +# Active layer: 128m: ... | -- -- 00 --
> 
> ...while here it is normally allocated data. Or is "--" supposed to mean
> you just don't write to that area...?

If we had subclusters, then it would be unallocated data.  But we don't.
 Since I just reset the image, I'm doing a write onto an unallocated
cluster in the current layer; the write request is to a subset of the
cluster, and the code checking whether the rest of the cluster reads as
zeros will see that a COW (and corresponding allocation) is required.

> 
>> +
>> +$QEMU_IO -c "write -P 1 $((size)) 512" "$TEST_IMG.base" | _filter_qemu_io
>> +$QEMU_IO -c "write -z $((size + 1024)) 512" "$TEST_IMG" | _filter_qemu_io
>> +$QEMU_IO -c "read -P 1 $((size)) 512" "$TEST_IMG" | _filter_qemu_io
>> +$QEMU_IO -c "read -P 0 $((size + 512)) 1536" "$TEST_IMG" | _filter_qemu_io
>> +$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
>> +
>> +CLUSTER_SIZE=512 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))
>> +_make_test_img -b "$TEST_IMG.base" $((size + 2048))
>> +
>> +# Writes at boundaries of (partial) cluster must not lose mid-cluster data
>> +# Backing file: 128m: ... | -- XX
>> +# Active layer: 128m: ... | 00 -- -- 00
>> +
>> +$QEMU_IO -c "write -P 1 $((size + 512)) 512" "$TEST_IMG.base" | _filter_qemu_io
>> +$QEMU_IO -c "write -z $((size)) 512" "$TEST_IMG" | _filter_qemu_io
>> +$QEMU_IO -c "read -P 0 $((size)) 512" "$TEST_IMG" | _filter_qemu_io
>> +$QEMU_IO -c "read -P 1 $((size + 512)) 512" "$TEST_IMG" | _filter_qemu_io
>> +$QEMU_IO -c "read -P 0 $((size + 1024)) 1024" "$TEST_IMG" | _filter_qemu_> +$QEMU_IO -c "write -z $((size + 1536)) 512" "$TEST_IMG" | _filter_qemu_io
> 
> [1]
> 
>> +$QEMU_IO -c "read -P 0 $((size)) 512" "$TEST_IMG" | _filter_qemu_io
>> +$QEMU_IO -c "read -P 1 $((size + 512)) 512" "$TEST_IMG" | _filter_qemu_io
>> +$QEMU_IO -c "read -P 0 $((size + 1024)) 1024" "$TEST_IMG" | _filter_qemu_io
>> +$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
>> +
>> +CLUSTER_SIZE=512 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))
>> +_make_test_img -b "$TEST_IMG.base" $((size + 2048))
>> +
>> +# Partial write must not lose data
>> +# Backing file: 128m: ... | -- --
>> +# Active layer: 128m: ... | -- -- XX 00
> 
> Well, you have basically tested this in [1] already. After the first
> write -z to the active layer, the final four sectors are fully allocated
> (as they are in a single cluster) and look like this:
> 
> 00 01 00 00 = XX XX XX XX
> 
> (Because the 00s are just written as data.)
> 
> Now the write -z in [1] just overwrites the last of those four sectors
> with zero data (albeit it being zero already).
> 
> But if you want to test it again, I'm not stopping you. :-)
> 

Well, you've already proven that I need to revisit this test with a
fine-tooth comb, especially if I split the series to do the blkdebug
stuff separately from the qcow2 stuff.
Eric Blake May 4, 2017, 2:47 a.m. UTC | #3
On 04/28/2017 04:24 PM, Eric Blake wrote:

>>> +echo
>>> +echo == unaligned image tail cluster, no allocation needed ==
>>> +
>>> +CLUSTER_SIZE=1024 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))
>>
>> Any reason for the CLUSTER_SIZE? It passes with 64 kB as well, and I
>> actually think that would be a better test because as it is, the image's
>> "unaligned" tail is exactly one cluster (so it isn't really unaligned).
> 
> Uhhh - copy-and-paste?  You're right, that 1024 is too small for what
> I'm actually doing with it.  :(

Actually, the whole test defaults to 4k clusters except when overridden,
but I did learn while hammering at things that we don't have a nice way
to tell that a backing file slightly shorter than the active file
behaves as though we read all zeros for the difference. I'm thinking of
proposing an RFC patch introducing BDRV_BLOCK_EOF, which gets set any
time bdrv_get_block_status() clamps the returns *pnum because it hit
EOF, to see what optimizations fall out elsewhere in the tree as a
result.  (It may not be worth it in the end, but we'll see).
diff mbox

Patch

diff --git a/block/qcow2.c b/block/qcow2.c
index 5c1573c..8038793 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2449,6 +2449,10 @@  static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
     BlockDriverState *file;
     int64_t res;

+    if (start + count > bs->total_sectors) {
+        count = bs->total_sectors - start;
+    }
+
     if (!count) {
         return true;
     }
@@ -2467,6 +2471,9 @@  static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
     uint32_t tail = (offset + count) % s->cluster_size;

     trace_qcow2_pwrite_zeroes_start_req(qemu_coroutine_self(), offset, count);
+    if (offset + count == bs->total_sectors * BDRV_SECTOR_SIZE) {
+        tail = 0;
+    }

     if (head || tail) {
         int64_t cl_start = (offset - head) >> BDRV_SECTOR_BITS;
diff --git a/tests/qemu-iotests/154 b/tests/qemu-iotests/154
index 7ca7219..005f09f 100755
--- a/tests/qemu-iotests/154
+++ b/tests/qemu-iotests/154
@@ -2,7 +2,7 @@ 
 #
 # qcow2 specific bdrv_pwrite_zeroes tests with backing files (complements 034)
 #
-# Copyright (C) 2016 Red Hat, Inc.
+# Copyright (C) 2016-2017 Red Hat, Inc.
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -42,7 +42,10 @@  _supported_proto file
 _supported_os Linux

 CLUSTER_SIZE=4k
-size=128M
+size=$((128 * 1024 * 1024))
+
+# This test requires zero clusters, added in v3 images
+_unsupported_imgopts compat=0.10

 echo
 echo == backing file contains zeros ==
@@ -299,6 +302,111 @@  $QEMU_IO -c "read -P 0 75k 1k" "$TEST_IMG" | _filter_qemu_io

 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map

+echo
+echo == unaligned image tail cluster, no allocation needed ==
+
+CLUSTER_SIZE=1024 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))
+_make_test_img -b "$TEST_IMG.base" $((size + 2048))
+
+# With no backing file, write to all or part of unallocated partial cluster
+
+# Backing file: 128m: ... | 00 --
+$QEMU_IO -c "write -z $size 512" "$TEST_IMG.base" | _filter_qemu_io
+
+# Backing file: 128m: ... | -- 00
+$QEMU_IO -c "write -z $size 512" "$TEST_IMG.base" | _filter_qemu_io
+
+# Backing file: 128m: ... | 00 00
+$QEMU_IO -c "write -z $size 1024" "$TEST_IMG.base" | _filter_qemu_io
+
+# No offset should be allocated, although the cluster itself is considered
+# allocated due to the zero flag
+$QEMU_IO -c "alloc $size 1024" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IMG map --output=json "$TEST_IMG.base" | _filter_qemu_img_map
+
+# With backing file that reads as zero, write to all or part of entire cluster
+
+# Backing file: 128m: ... | 00 00
+# Active layer: 128m: ... | 00 00 00 00
+$QEMU_IO -c "write -z $size 2048" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+
+# Backing file: 128m: ... | 00 00
+# Active layer: 128m: ... | 00 00 -- --
+$QEMU_IO -c "write -z $((size)) 1024" "$TEST_IMG" | _filter_qemu_io
+
+# Backing file: 128m: ... | 00 00
+# Active layer: 128m: ... | -- -- 00 00
+$QEMU_IO -c "write -z $((size + 1024)) 1024" "$TEST_IMG" | _filter_qemu_io
+
+# Backing file: 128m: ... | 00 00
+# Active layer: 128m: ... | -- 00 00 --
+$QEMU_IO -c "write -z $((size + 512)) 1024" "$TEST_IMG" | _filter_qemu_io
+
+# No offset should be allocated, although the cluster itself is considered
+# allocated due to the zero flag
+$QEMU_IO -c "alloc $size 2048" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+
+# Preallocated cluster is not freed when unmap is not requested
+
+# Backing file: 128m: ... | XX -- => ... | XX 00
+$QEMU_IO -c "write $((size)) 512" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c "write -z $((size + 512)) 512" "$TEST_IMG.base" | _filter_qemu_io
+
+# Backing file: 128m: ... | XX 00 => ... | 00 00
+$QEMU_IO -c "write -z $((size)) 1024" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IMG map --output=json "$TEST_IMG.base" | _filter_qemu_img_map |
+    sed 's/"offset": [0-9]*/"offset": OFFSET/g'
+
+echo
+echo == unaligned image tail cluster, allocation required ==
+
+CLUSTER_SIZE=512 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))
+_make_test_img -b "$TEST_IMG.base" $((size + 2048))
+
+# Write beyond backing file must COW
+# Backing file: 128m: ... | XX --
+# Active layer: 128m: ... | -- -- 00 --
+
+$QEMU_IO -c "write -P 1 $((size)) 512" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c "write -z $((size + 1024)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 1 $((size)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 $((size + 512)) 1536" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+
+CLUSTER_SIZE=512 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))
+_make_test_img -b "$TEST_IMG.base" $((size + 2048))
+
+# Writes at boundaries of (partial) cluster must not lose mid-cluster data
+# Backing file: 128m: ... | -- XX
+# Active layer: 128m: ... | 00 -- -- 00
+
+$QEMU_IO -c "write -P 1 $((size + 512)) 512" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c "write -z $((size)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 $((size)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 1 $((size + 512)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 $((size + 1024)) 1024" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -z $((size + 1536)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 $((size)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 1 $((size + 512)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 $((size + 1024)) 1024" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+
+CLUSTER_SIZE=512 TEST_IMG="$TEST_IMG.base" _make_test_img $((size + 1024))
+_make_test_img -b "$TEST_IMG.base" $((size + 2048))
+
+# Partial write must not lose data
+# Backing file: 128m: ... | -- --
+# Active layer: 128m: ... | -- -- XX 00
+
+$QEMU_IO -c "write -P 1 $((size + 1024)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -z $((size + 1536)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 $((size)) 1024" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 1 $((size + 1024)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 $((size + 1536)) 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
index da9eabd..f8a09af 100644
--- a/tests/qemu-iotests/154.out
+++ b/tests/qemu-iotests/154.out
@@ -282,4 +282,87 @@  read 1024/1024 bytes at offset 76800
 { "start": 69632, "length": 4096, "depth": 0, "zero": true, "data": false},
 { "start": 73728, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": 24576},
 { "start": 77824, "length": 134139904, "depth": 1, "zero": true, "data": false}]
+
+== unaligned image tail cluster, no allocation needed ==
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134218752
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
+wrote 512/512 bytes at offset 134217728
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 512/512 bytes at offset 134217728
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 1024/1024 bytes at offset 134217728
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+1024/1024 bytes allocated at offset 128 MiB
+[{ "start": 0, "length": 134218752, "depth": 0, "zero": true, "data": false}]
+wrote 2048/2048 bytes at offset 134217728
+2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+[{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
+wrote 1024/1024 bytes at offset 134217728
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 1024/1024 bytes at offset 134218752
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 1024/1024 bytes at offset 134218240
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+2048/2048 bytes allocated at offset 128 MiB
+[{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
+wrote 512/512 bytes at offset 134217728
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 512/512 bytes at offset 134218240
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 1024/1024 bytes at offset 134217728
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+[{ "start": 0, "length": 134217728, "depth": 0, "zero": true, "data": false},
+{ "start": 134217728, "length": 1024, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
+
+== unaligned image tail cluster, allocation required ==
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134218752
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
+wrote 512/512 bytes at offset 134217728
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 512/512 bytes at offset 134218752
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 134217728
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1536/1536 bytes at offset 134218240
+1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+[{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": 20480}]
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134218752
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
+wrote 512/512 bytes at offset 134218240
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 512/512 bytes at offset 134217728
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 134217728
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 134218240
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1024/1024 bytes at offset 134218752
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 512/512 bytes at offset 134219264
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 134217728
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 134218240
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1024/1024 bytes at offset 134218752
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+[{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": 20480}]
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134218752
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
+wrote 512/512 bytes at offset 134218752
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 512/512 bytes at offset 134219264
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1024/1024 bytes at offset 134217728
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 134218752
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 134219264
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+[{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": 20480}]
 *** done