diff mbox series

[v4,24/30] qcow2: Clear the L2 bitmap when allocating a compressed cluster

Message ID 6d596d82ed62615a8565b661691a06bfaf32237e.1584468723.git.berto@igalia.com (mailing list archive)
State New, archived
Headers show
Series Add subcluster allocation to qcow2 | expand

Commit Message

Alberto Garcia March 17, 2020, 6:16 p.m. UTC
Compressed clusters always have the bitmap part of the extended L2
entry set to 0.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2-cluster.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Alberto Garcia April 24, 2020, 5:02 p.m. UTC | #1
On Tue 17 Mar 2020 07:16:21 PM CET, Alberto Garcia <berto@igalia.com> wrote:
> Compressed clusters always have the bitmap part of the extended L2
> entry set to 0.

I was just finishing some improvements to the new code that allows
BDRV_REQ_ZERO_WRITE at the subcluster level, and I'm starting to
entertain the idea of using the L2 bitmap for compressed clusters as
well.

I will make some tests next week, but I would like to know your opinion
in case I'm missing something.

A compressed cluster cannot be divided into subclusters on the image:
you would not be able to allocate or overwrite them separately,
therefore any write request necessarily has to write (or do COW of) the
whole cluster.

However if you consider the uncompressed guest data I don't see any
reason why you wouldn't be able to zeroize or even deallocate individual
subclusters. These operations don't touch the cluster data on disk
anyway, they only touch the L2 metadata in order to change what the
guest sees.

'write -c 0 64k' followed by 'write -z 16k 16k' would not need to do any
copy on write. The compressed data would remain untouched on disk but
some of the subclusters would have the 'all zeroes' bit set, exactly
like what happens with normal clusters.

I think that this would make the on-disk format a bit simpler in general
(no need to treat compressed clusters differently in some cases) and it
would add a new optimization to compressed images. I just need to make
sure that it doesn't complicate the code (my feeling is that it would
actually simplify it, but I have to see).

Opinions?

Berto
Eric Blake April 24, 2020, 5:11 p.m. UTC | #2
On 4/24/20 12:02 PM, Alberto Garcia wrote:
> On Tue 17 Mar 2020 07:16:21 PM CET, Alberto Garcia <berto@igalia.com> wrote:
>> Compressed clusters always have the bitmap part of the extended L2
>> entry set to 0.
> 
> I was just finishing some improvements to the new code that allows
> BDRV_REQ_ZERO_WRITE at the subcluster level, and I'm starting to
> entertain the idea of using the L2 bitmap for compressed clusters as
> well.
> 
> I will make some tests next week, but I would like to know your opinion
> in case I'm missing something.
> 
> A compressed cluster cannot be divided into subclusters on the image:
> you would not be able to allocate or overwrite them separately,
> therefore any write request necessarily has to write (or do COW of) the
> whole cluster.
> 
> However if you consider the uncompressed guest data I don't see any
> reason why you wouldn't be able to zeroize or even deallocate individual
> subclusters. These operations don't touch the cluster data on disk
> anyway, they only touch the L2 metadata in order to change what the
> guest sees.
> 
> 'write -c 0 64k' followed by 'write -z 16k 16k' would not need to do any
> copy on write. The compressed data would remain untouched on disk but
> some of the subclusters would have the 'all zeroes' bit set, exactly
> like what happens with normal clusters.

It's a special case that avoids COW for write zeroes, but not for 
anything else. The moment you write any data (whether to the 
zero-above-compressed or the regular compressed portion), the entire 
cluster has to be rewritten.  I'm not sure how frequently guests will 
actually have the scenario of doing a zero request on a sub-cluster, but 
at the same time, I can see where you're coming from in stating that if 
it makes management of extended L2 easier to allow zero subclusters on 
top of a compressed cluster, then there's no reason to forbid it.

> 
> I think that this would make the on-disk format a bit simpler in general
> (no need to treat compressed clusters differently in some cases) and it
> would add a new optimization to compressed images. I just need to make
> sure that it doesn't complicate the code (my feeling is that it would
> actually simplify it, but I have to see).
> 
> Opinions?
> 
> Berto
>
Alberto Garcia April 24, 2020, 5:21 p.m. UTC | #3
On Fri 24 Apr 2020 07:11:08 PM CEST, Eric Blake <eblake@redhat.com> wrote:
>> 'write -c 0 64k' followed by 'write -z 16k 16k' would not need to do any
>> copy on write. The compressed data would remain untouched on disk but
>> some of the subclusters would have the 'all zeroes' bit set, exactly
>> like what happens with normal clusters.
>
> It's a special case that avoids COW for write zeroes, but not for
> anything else. The moment you write any data (whether to the
> zero-above-compressed or the regular compressed portion), the entire
> cluster has to be rewritten.

That's right but you can still write zeroes without having to rewrite
anything, and read back the zeroes without having to decompress the
data.

> at the same time, I can see where you're coming from in stating that
> if it makes management of extended L2 easier to allow zero subclusters
> on top of a compressed cluster, then there's no reason to forbid it.

I'm not sure if it makes it easier. Some operations are definitely going
to be easier but maybe we have to add and handle _ZERO_COMPRESSED in
addition to _ZERO_PLAIN and _ZERO_ALLOC (the same for unallocated
subclusters). Or maybe replace QCow2SubclusterType with something
else. I need to evaluate that.

Berto
Eric Blake April 24, 2020, 5:44 p.m. UTC | #4
On 4/24/20 12:21 PM, Alberto Garcia wrote:
> On Fri 24 Apr 2020 07:11:08 PM CEST, Eric Blake <eblake@redhat.com> wrote:
>>> 'write -c 0 64k' followed by 'write -z 16k 16k' would not need to do any
>>> copy on write. The compressed data would remain untouched on disk but
>>> some of the subclusters would have the 'all zeroes' bit set, exactly
>>> like what happens with normal clusters.
>>
>> It's a special case that avoids COW for write zeroes, but not for
>> anything else. The moment you write any data (whether to the
>> zero-above-compressed or the regular compressed portion), the entire
>> cluster has to be rewritten.
> 
> That's right but you can still write zeroes without having to rewrite
> anything, and read back the zeroes without having to decompress the
> data.
> 
>> at the same time, I can see where you're coming from in stating that
>> if it makes management of extended L2 easier to allow zero subclusters
>> on top of a compressed cluster, then there's no reason to forbid it.
> 
> I'm not sure if it makes it easier. Some operations are definitely going
> to be easier but maybe we have to add and handle _ZERO_COMPRESSED in
> addition to _ZERO_PLAIN and _ZERO_ALLOC (the same for unallocated
> subclusters). Or maybe replace QCow2SubclusterType with something
> else. I need to evaluate that.

Reading the entire cluster will be interesting - you'll have to 
decompress the entire memory, then overwrite the zeroed portions.  The 
savings in reading occur only when your read is limited to just the 
subclusters that are zeroed.

But then again, even on a regular cluster, read has to pay attention to 
which subclusters are zeroed, so you already have the workhorse in read 
for detecting whether a normal read is sufficient or if you have to 
follow up with piecing together zeroed sections.
Alberto Garcia April 24, 2020, 5:56 p.m. UTC | #5
On Fri 24 Apr 2020 07:44:33 PM CEST, Eric Blake <eblake@redhat.com> wrote:
>>> at the same time, I can see where you're coming from in stating that
>>> if it makes management of extended L2 easier to allow zero subclusters
>>> on top of a compressed cluster, then there's no reason to forbid it.
>> 
>> I'm not sure if it makes it easier. Some operations are definitely going
>> to be easier but maybe we have to add and handle _ZERO_COMPRESSED in
>> addition to _ZERO_PLAIN and _ZERO_ALLOC (the same for unallocated
>> subclusters). Or maybe replace QCow2SubclusterType with something
>> else. I need to evaluate that.
>
> Reading the entire cluster will be interesting - you'll have to
> decompress the entire memory, then overwrite the zeroed portions.

I don't think so, qcow2_get_host_offset() would detect the number of
contiguous subclusters of the same type at the given offset. In this
case they would be _ZERO subclusters so there's no need to decompress
anything, or even read it (it works the same with uncompressed
clusters).

Berto
Vladimir Sementsov-Ogievskiy April 24, 2020, 6:15 p.m. UTC | #6
24.04.2020 20:44, Eric Blake wrote:
> On 4/24/20 12:21 PM, Alberto Garcia wrote:
>> On Fri 24 Apr 2020 07:11:08 PM CEST, Eric Blake <eblake@redhat.com> wrote:
>>>> 'write -c 0 64k' followed by 'write -z 16k 16k' would not need to do any
>>>> copy on write. The compressed data would remain untouched on disk but
>>>> some of the subclusters would have the 'all zeroes' bit set, exactly
>>>> like what happens with normal clusters.
>>>
>>> It's a special case that avoids COW for write zeroes, but not for
>>> anything else. The moment you write any data (whether to the
>>> zero-above-compressed or the regular compressed portion), the entire
>>> cluster has to be rewritten.
>>
>> That's right but you can still write zeroes without having to rewrite
>> anything, and read back the zeroes without having to decompress the
>> data.
>>
>>> at the same time, I can see where you're coming from in stating that
>>> if it makes management of extended L2 easier to allow zero subclusters
>>> on top of a compressed cluster, then there's no reason to forbid it.
>>
>> I'm not sure if it makes it easier. Some operations are definitely going
>> to be easier but maybe we have to add and handle _ZERO_COMPRESSED in
>> addition to _ZERO_PLAIN and _ZERO_ALLOC (the same for unallocated
>> subclusters). Or maybe replace QCow2SubclusterType with something
>> else. I need to evaluate that.
> 
> Reading the entire cluster will be interesting - you'll have to decompress the entire memory, then overwrite the zeroed portions.  The savings in reading occur only when your read is limited to just the subclusters that are zeroed.
> 
> But then again, even on a regular cluster, read has to pay attention to which subclusters are zeroed, so you already have the workhorse in read for detecting whether a normal read is sufficient or if you have to follow up with piecing together zeroed sections.
> 

AFAIK, now compressed clusters can't be used in scenarios with guest, as qcow2 driver doesn't support rewriting them. Or am I wrong? And we normally don't combine normal and compressed clusters together in one image. So, currently, the usual use-case of compressed clusters is a fully compressed image, written once.

It means, that with current specification, subclusters adds nothing to this case, and no reason to create compressed image with subclusters. And even if we allow zero/unallocated subclusters, seems it adds nothing to this use-case.

So, I don't see real benefits of it for now, but neither any problems with it, so agree that it's mostly about which way is simpler..
Vladimir Sementsov-Ogievskiy April 24, 2020, 6:25 p.m. UTC | #7
24.04.2020 20:56, Alberto Garcia wrote:
> On Fri 24 Apr 2020 07:44:33 PM CEST, Eric Blake <eblake@redhat.com> wrote:
>>>> at the same time, I can see where you're coming from in stating that
>>>> if it makes management of extended L2 easier to allow zero subclusters
>>>> on top of a compressed cluster, then there's no reason to forbid it.
>>>
>>> I'm not sure if it makes it easier. Some operations are definitely going
>>> to be easier but maybe we have to add and handle _ZERO_COMPRESSED in
>>> addition to _ZERO_PLAIN and _ZERO_ALLOC (the same for unallocated
>>> subclusters). Or maybe replace QCow2SubclusterType with something
>>> else. I need to evaluate that.

Reviewing your series it already came in my mind, that we are doing too much with the conversion from l2e flags to "type". Does it worth it? All these ZERO_PLAIN and UNALLOCATED_ALLOC, and "case <TYPE>:" lines combined by three-four into one case, do they help, or is it an extra work? We just have to maintain two views of one model.. But I don't suggest to refactor it in these series :)

>>
>> Reading the entire cluster will be interesting - you'll have to
>> decompress the entire memory, then overwrite the zeroed portions.
> 
> I don't think so, qcow2_get_host_offset() would detect the number of
> contiguous subclusters of the same type at the given offset. In this
> case they would be _ZERO subclusters so there's no need to decompress
> anything, or even read it (it works the same with uncompressed
> clusters).
> 

But if at least one of subclusters to read is not _ZERO, you'll have to decompress the whole cluster, and after decompression rewrite zero-subclusters by zeroes, as Eric says.. Or I lost the thread:)
Alberto Garcia April 24, 2020, 6:37 p.m. UTC | #8
On Fri 24 Apr 2020 08:25:45 PM CEST, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>> Reading the entire cluster will be interesting - you'll have to
>>> decompress the entire memory, then overwrite the zeroed portions.
>> 
>> I don't think so, qcow2_get_host_offset() would detect the number of
>> contiguous subclusters of the same type at the given offset. In this
>> case they would be _ZERO subclusters so there's no need to decompress
>> anything, or even read it (it works the same with uncompressed
>> clusters).
>
> But if at least one of subclusters to read is not _ZERO, you'll have
> to decompress the whole cluster, and after decompression rewrite
> zero-subclusters by zeroes, as Eric says.. Or I lost the thread:)

I don't see why you would need to rewrite anything... you do have to
decompress the whole cluster, and the uncompressed cluster in memory
would have stale data, but you never need to use that data for anything,
let alone to return it to the guest.

Even if there's a COW, the new cluster would inherit the compressed
cluster's bitmap so the zeroized subclusters still read as zeroes.

It's the same with normal clusters, 'write -P 0xff 0 64k' followed by
'write -z 16k 16k'. The host cluster on disk still reads as 0xff but the
L2 entry indicates that part of it is just zeroes.

Berto
Alberto Garcia April 24, 2020, 6:41 p.m. UTC | #9
On Fri 24 Apr 2020 08:15:04 PM CEST, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> AFAIK, now compressed clusters can't be used in scenarios with guest,
> as qcow2 driver doesn't support rewriting them.

You can write to those images just fine, it's just not efficient because
you have to COW the compressed clusters.

> Or am I wrong? And we normally don't combine normal and compressed
> clusters together in one image.

As soon as you start writing to an image with compressed clusters you'll
have a combination of both.

But it's true that you don't have an image with compressed clusters if
what you're looking for is performance. So I wouldn't add support for
this if it complicates things too much.

Berto
Eric Blake April 24, 2020, 6:47 p.m. UTC | #10
On 4/24/20 1:37 PM, Alberto Garcia wrote:
> On Fri 24 Apr 2020 08:25:45 PM CEST, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>>> Reading the entire cluster will be interesting - you'll have to
>>>> decompress the entire memory, then overwrite the zeroed portions.
>>>
>>> I don't think so, qcow2_get_host_offset() would detect the number of
>>> contiguous subclusters of the same type at the given offset. In this
>>> case they would be _ZERO subclusters so there's no need to decompress
>>> anything, or even read it (it works the same with uncompressed
>>> clusters).
>>
>> But if at least one of subclusters to read is not _ZERO, you'll have
>> to decompress the whole cluster, and after decompression rewrite
>> zero-subclusters by zeroes, as Eric says.. Or I lost the thread:)
> 
> I don't see why you would need to rewrite anything... you do have to
> decompress the whole cluster, and the uncompressed cluster in memory
> would have stale data, but you never need to use that data for anything,
> let alone to return it to the guest.
> 
> Even if there's a COW, the new cluster would inherit the compressed
> cluster's bitmap so the zeroized subclusters still read as zeroes.
> 
> It's the same with normal clusters, 'write -P 0xff 0 64k' followed by
> 'write -z 16k 16k'. The host cluster on disk still reads as 0xff but the
> L2 entry indicates that part of it is just zeroes.

The point is this:  Consider 'write -P 0xff 0 64k', then 'write -z 16k 
16k', then 'read 0 64k'.  For normal clusters, we can just do a 
scatter-gather iov read of read 0-16k and 32-64k, plus a memset of 
16-32k.  But for compressed clusters, we have to read and decompress the 
entire 64k, AND also memset 16k-32k.  But if zeroing after reading is 
not that expensive, then the same technique for normal clusters is fine 
(instead of a scatter-gather read of 48k, just read the whole 64k 
cluster before doing the memset).  So the question at hand is not what 
happens in writing, but in reading, and whether we are penalizing reads 
from a compressed cluster or even from regular clusters, when reading 
from a cluster where subclusters have different status.
Vladimir Sementsov-Ogievskiy April 25, 2020, 6:38 a.m. UTC | #11
24.04.2020 21:41, Alberto Garcia wrote:
> On Fri 24 Apr 2020 08:15:04 PM CEST, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>> AFAIK, now compressed clusters can't be used in scenarios with guest,
>> as qcow2 driver doesn't support rewriting them.
> 
> You can write to those images just fine, it's just not efficient because
> you have to COW the compressed clusters.

No, rewriting doesn't work:

[root@kvm master]# ./qemu-img create -f qcow2 x 10M
Formatting 'x', fmt=qcow2 size=10485760 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[root@kvm master]# ./qemu-io -c 'write -c 0 64K' x
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.23 sec (278.708 KiB/sec and 4.3548 ops/sec)
[root@kvm master]# ./qemu-io -c 'write -c 0 64K' x
write failed: Input/output error


> 
>> Or am I wrong? And we normally don't combine normal and compressed
>> clusters together in one image.
> 
> As soon as you start writing to an image with compressed clusters you'll
> have a combination of both.

Ah, you mean, rewriting compressed clusters by normal.. So the use-case is just take compressed backup image and use it for the guest, instead of converting first.. It makes sense.

> 
> But it's true that you don't have an image with compressed clusters if
> what you're looking for is performance. So I wouldn't add support for
> this if it complicates things too much.
> 
> Berto
>
Max Reitz April 27, 2020, 7:49 a.m. UTC | #12
On 24.04.20 20:47, Eric Blake wrote:
> On 4/24/20 1:37 PM, Alberto Garcia wrote:
>> On Fri 24 Apr 2020 08:25:45 PM CEST, Vladimir Sementsov-Ogievskiy
>> <vsementsov@virtuozzo.com> wrote:
>>>>> Reading the entire cluster will be interesting - you'll have to
>>>>> decompress the entire memory, then overwrite the zeroed portions.
>>>>
>>>> I don't think so, qcow2_get_host_offset() would detect the number of
>>>> contiguous subclusters of the same type at the given offset. In this
>>>> case they would be _ZERO subclusters so there's no need to decompress
>>>> anything, or even read it (it works the same with uncompressed
>>>> clusters).
>>>
>>> But if at least one of subclusters to read is not _ZERO, you'll have
>>> to decompress the whole cluster, and after decompression rewrite
>>> zero-subclusters by zeroes, as Eric says.. Or I lost the thread:)
>>
>> I don't see why you would need to rewrite anything... you do have to
>> decompress the whole cluster, and the uncompressed cluster in memory
>> would have stale data, but you never need to use that data for anything,
>> let alone to return it to the guest.
>>
>> Even if there's a COW, the new cluster would inherit the compressed
>> cluster's bitmap so the zeroized subclusters still read as zeroes.
>>
>> It's the same with normal clusters, 'write -P 0xff 0 64k' followed by
>> 'write -z 16k 16k'. The host cluster on disk still reads as 0xff but the
>> L2 entry indicates that part of it is just zeroes.
> 
> The point is this:  Consider 'write -P 0xff 0 64k', then 'write -z 16k
> 16k', then 'read 0 64k'.  For normal clusters, we can just do a
> scatter-gather iov read of read 0-16k and 32-64k, plus a memset of
> 16-32k.  But for compressed clusters, we have to read and decompress the
> entire 64k, AND also memset 16k-32k.  But if zeroing after reading is
> not that expensive, then the same technique for normal clusters is fine
> (instead of a scatter-gather read of 48k, just read the whole 64k
> cluster before doing the memset).

It would also mean letting qcow2_co_preadv_part() special-handle such
cases, i.e., whenever the whole clusters is compressed, it needs to read
it as a whole, regardless of the subcluster status, and then memset()
all areas to zero that are all-zero subclusters.  Otherwise we’d read
and decompress the whole buffer twice (once for 0 to 16k, once for 32k
to 64k).

This may be complicated a bit by the task schema, i.e. that reads are
scheduled in the task pool.  For qcow2_co_preadv_part() to memset some
area after decompression, it would need to wait on the read_compressed
task, which would make the whole task pool thing moot (for compressed
clusters).  Or it just does the memset() at the end, when we have to
settle the task pool anyway, but then it would have to remember all
areas it still needs to zero.

Hm, or, qcow2_co_preadv_compresed() could figure out where the zeroed
subclusters are and then memset() them itself, e.g. by receiving the
subcluster bitmap.  Probably the simplest implementation, but it seems a
bit like a layering breach.

Not sure how bad the complexity is on the write side for not letting
zero writes just zero the subcluster, but it doesn’t seem to me that the
opposite would come for free on the read side.

Max
Alberto Garcia April 27, 2020, 6:12 p.m. UTC | #13
On Mon 27 Apr 2020 09:49:00 AM CEST, Max Reitz wrote:
>> The point is this: Consider 'write -P 0xff 0 64k', then 'write -z 16k
>> 16k', then 'read 0 64k'. For normal clusters, we can just do a
>> scatter-gather iov read of read 0-16k and 32-64k, plus a memset of
>> 16-32k. But for compressed clusters, we have to read and decompress
>> the entire 64k, AND also memset 16k-32k. But if zeroing after reading
>> is not that expensive, then the same technique for normal clusters is
>> fine (instead of a scatter-gather read of 48k, just read the whole
>> 64k cluster before doing the memset).
>
> It would also mean letting qcow2_co_preadv_part() special-handle such
> cases, i.e., whenever the whole clusters is compressed, it needs to
> read it as a whole, regardless of the subcluster status, and then
> memset() all areas to zero that are all-zero subclusters.  Otherwise
> we’d read and decompress the whole buffer twice (once for 0 to 16k,
> once for 32k to 64k).

This is actually a good reason against adding subcluster allocation to
compressed clusters.

I wouldn't like to complicate the code for this use case, so we either
don't support it at all, or we support it with the problem that you
mention (decompressing the whole buffer more than once if the cluster
contains holes).

> Not sure how bad the complexity is on the write side for not letting
> zero writes just zero the subcluster

It is not bad, I just have to check the cluster type and return
-ENOTSUP.

Berto
diff mbox series

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index dfd8b66958..1f471db98c 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -806,6 +806,9 @@  int qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
     BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
     qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
     set_l2_entry(s, l2_slice, l2_index, cluster_offset);
+    if (has_subclusters(s)) {
+        set_l2_bitmap(s, l2_slice, l2_index, 0);
+    }
     qcow2_cache_put(s->l2_table_cache, (void **) &l2_slice);
 
     *host_offset = cluster_offset & s->cluster_offset_mask;