diff mbox series

[v4,07/30] qcow2: Document the Extended L2 Entries feature

Message ID aa1ac3fbb1d42db67d930b76255c9b7b19c07b42.1584468723.git.berto@igalia.com (mailing list archive)
State New, archived
Headers show
Series Add subcluster allocation to qcow2 | expand

Commit Message

Alberto Garcia March 17, 2020, 6:16 p.m. UTC
Subcluster allocation in qcow2 is implemented by extending the
existing L2 table entries and adding additional information to
indicate the allocation status of each subcluster.

This patch documents the changes to the qcow2 format and how they
affect the calculation of the L2 cache size.

Signed-off-by: Alberto Garcia <berto@igalia.com>
---
 docs/interop/qcow2.txt | 68 ++++++++++++++++++++++++++++++++++++++++--
 docs/qcow2-cache.txt   | 19 +++++++++++-
 2 files changed, 83 insertions(+), 4 deletions(-)

Comments

Max Reitz April 8, 2020, 11:09 a.m. UTC | #1
On 17.03.20 19:16, Alberto Garcia wrote:
> Subcluster allocation in qcow2 is implemented by extending the
> existing L2 table entries and adding additional information to
> indicate the allocation status of each subcluster.
> 
> This patch documents the changes to the qcow2 format and how they
> affect the calculation of the L2 cache size.
> 
> Signed-off-by: Alberto Garcia <berto@igalia.com>
> ---
>  docs/interop/qcow2.txt | 68 ++++++++++++++++++++++++++++++++++++++++--
>  docs/qcow2-cache.txt   | 19 +++++++++++-
>  2 files changed, 83 insertions(+), 4 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>
Eric Blake April 9, 2020, 3:12 p.m. UTC | #2
On 3/17/20 1:16 PM, Alberto Garcia wrote:
> Subcluster allocation in qcow2 is implemented by extending the
> existing L2 table entries and adding additional information to
> indicate the allocation status of each subcluster.
> 
> This patch documents the changes to the qcow2 format and how they
> affect the calculation of the L2 cache size.
> 
> Signed-off-by: Alberto Garcia <berto@igalia.com>
> ---
>   docs/interop/qcow2.txt | 68 ++++++++++++++++++++++++++++++++++++++++--
>   docs/qcow2-cache.txt   | 19 +++++++++++-
>   2 files changed, 83 insertions(+), 4 deletions(-)
> 

> +== Extended L2 Entries ==
> +
> +An image uses Extended L2 Entries if bit 4 is set on the incompatible_features
> +field of the header.
> +
> +In these images standard data clusters are divided into 32 subclusters of the
> +same size. They are contiguous and start from the beginning of the cluster.
> +Subclusters can be allocated independently and the L2 entry contains information
> +indicating the status of each one of them. Compressed data clusters don't have
> +subclusters so they are treated the same as in images without this feature.
> +
> +The size of an extended L2 entry is 128 bits so the number of entries per table
> +is calculated using this formula:
> +
> +    l2_entries = (cluster_size / (2 * sizeof(uint64_t)))
> +
> +The first 64 bits have the same format as the standard L2 table entry described
> +in the previous section, with the exception of bit 0 of the standard cluster
> +descriptor.
> +
> +The last 64 bits contain a subcluster allocation bitmap with this format:
> +
> +Subcluster Allocation Bitmap (for standard clusters):
> +
> +    Bit  0 -  31:   Allocation status (one bit per subcluster)
> +
> +                    1: the subcluster is allocated. In this case the
> +                       host cluster offset field must contain a valid
> +                       offset.
> +                    0: the subcluster is not allocated. In this case
> +                       read requests shall go to the backing file or
> +                       return zeros if there is no backing file data.

Hmm - raw external files are incompatible with backing files.  Should we 
also document that extended L2 entries are incompatible with raw 
external files?  (The text here reminded me about it, but it would be 
the text earlier at the incompatible feature bits that we edit if we 
want that additional restriction; compare to the restriction in the 
autoclear bit 1).  After all, when raw external file is enabled, the 
entire image is allocated, at which point subclusters don't make much sense.

And in stating that, it looks like we have a pre-existing hole in that 
header bytes 8-15 don't mention the incompatibility with autoclear (when 
things are incompatible, it's best to mention the restriction from both 
sides, rather than only one of the sides, to make sure the reader 
notices the restriction regardless of which field they look up first). 
But tweaking that would be a separate patch.
Vladimir Sementsov-Ogievskiy April 10, 2020, 9:29 a.m. UTC | #3
09.04.2020 18:12, Eric Blake wrote:
> On 3/17/20 1:16 PM, Alberto Garcia wrote:
>> Subcluster allocation in qcow2 is implemented by extending the
>> existing L2 table entries and adding additional information to
>> indicate the allocation status of each subcluster.
>>
>> This patch documents the changes to the qcow2 format and how they
>> affect the calculation of the L2 cache size.
>>
>> Signed-off-by: Alberto Garcia <berto@igalia.com>
>> ---
>>   docs/interop/qcow2.txt | 68 ++++++++++++++++++++++++++++++++++++++++--
>>   docs/qcow2-cache.txt   | 19 +++++++++++-
>>   2 files changed, 83 insertions(+), 4 deletions(-)
>>
> 
>> +== Extended L2 Entries ==
>> +
>> +An image uses Extended L2 Entries if bit 4 is set on the incompatible_features
>> +field of the header.
>> +
>> +In these images standard data clusters are divided into 32 subclusters of the
>> +same size. They are contiguous and start from the beginning of the cluster.
>> +Subclusters can be allocated independently and the L2 entry contains information
>> +indicating the status of each one of them. Compressed data clusters don't have
>> +subclusters so they are treated the same as in images without this feature.
>> +
>> +The size of an extended L2 entry is 128 bits so the number of entries per table
>> +is calculated using this formula:
>> +
>> +    l2_entries = (cluster_size / (2 * sizeof(uint64_t)))
>> +
>> +The first 64 bits have the same format as the standard L2 table entry described
>> +in the previous section, with the exception of bit 0 of the standard cluster
>> +descriptor.
>> +
>> +The last 64 bits contain a subcluster allocation bitmap with this format:
>> +
>> +Subcluster Allocation Bitmap (for standard clusters):
>> +
>> +    Bit  0 -  31:   Allocation status (one bit per subcluster)
>> +
>> +                    1: the subcluster is allocated. In this case the
>> +                       host cluster offset field must contain a valid
>> +                       offset.
>> +                    0: the subcluster is not allocated. In this case
>> +                       read requests shall go to the backing file or
>> +                       return zeros if there is no backing file data.
> 
> Hmm - raw external files are incompatible with backing files.  Should we also document that extended L2 entries are incompatible with raw external files?  (The text here reminded me about it, but it would be the text earlier at the incompatible feature bits that we edit if we want that additional restriction; compare to the restriction in the autoclear bit 1).  After all, when raw external file is enabled, the entire image is allocated, at which point subclusters don't make much sense.

It still may cache information about zeroed subclusters: gives more detailed block-status. But we should mention somehow external files. Hm. not only for raw external files, but it is documented that cluster can't be unallocated when an external data file is used.

> 
> And in stating that, it looks like we have a pre-existing hole in that header bytes 8-15 don't mention the incompatibility with autoclear (when things are incompatible, it's best to mention the restriction from both sides, rather than only one of the sides, to make sure the reader notices the restriction regardless of which field they look up first). But tweaking that would be a separate patch.
>
Alberto Garcia April 10, 2020, 12:01 p.m. UTC | #4
On Thu 09 Apr 2020 05:12:16 PM CEST, Eric Blake wrote:
> Hmm - raw external files are incompatible with backing files.  Should
> we also document that extended L2 entries are incompatible with raw
> external files?

Ok, I can also add additional checks to forbid creating such images.

Berto
Alberto Garcia April 14, 2020, 2:50 p.m. UTC | #5
On Fri 10 Apr 2020 11:29:59 AM CEST, Vladimir Sementsov-Ogievskiy wrote:
>> Hmm - raw external files are incompatible with backing files. Should
>> we also document that extended L2 entries are incompatible with raw
>> external files? (The text here reminded me about it, but it would be
>> the text earlier at the incompatible feature bits that we edit if we
>> want that additional restriction; compare to the restriction in the
>> autoclear bit 1). After all, when raw external file is enabled, the
>> entire image is allocated, at which point subclusters don't make much
>> sense.
> It still may cache information about zeroed subclusters: gives more
> detailed block-status. But we should mention somehow external
> files. Hm. not only for raw external files, but it is documented that
> cluster can't be unallocated when an external data file is used.

What do you mean by "cluster can't be unallocated" ?

Berto
Vladimir Sementsov-Ogievskiy April 14, 2020, 4:19 p.m. UTC | #6
14.04.2020 17:50, Alberto Garcia wrote:
> On Fri 10 Apr 2020 11:29:59 AM CEST, Vladimir Sementsov-Ogievskiy wrote:
>>> Hmm - raw external files are incompatible with backing files. Should
>>> we also document that extended L2 entries are incompatible with raw
>>> external files? (The text here reminded me about it, but it would be
>>> the text earlier at the incompatible feature bits that we edit if we
>>> want that additional restriction; compare to the restriction in the
>>> autoclear bit 1). After all, when raw external file is enabled, the
>>> entire image is allocated, at which point subclusters don't make much
>>> sense.
>> It still may cache information about zeroed subclusters: gives more
>> detailed block-status. But we should mention somehow external
>> files. Hm. not only for raw external files, but it is documented that
>> cluster can't be unallocated when an external data file is used.
> 
> What do you mean by "cluster can't be unallocated" ?
> 


I mean this sentence from qcow2.txt:

                    "The offset may only be 0 with
                     bit 63 set (indicating a host cluster offset of 0) when an
                     external data file is used."

In other words, cluster can't be unallocated with data file in use.
Alberto Garcia April 14, 2020, 4:30 p.m. UTC | #7
On Tue 14 Apr 2020 06:19:18 PM CEST, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>> It still may cache information about zeroed subclusters: gives more
>>> detailed block-status. But we should mention somehow external
>>> files. Hm. not only for raw external files, but it is documented that
>>> cluster can't be unallocated when an external data file is used.
>> 
>> What do you mean by "cluster can't be unallocated" ?
>
> I mean this sentence from qcow2.txt:
>
>                     "The offset may only be 0 with
>                      bit 63 set (indicating a host cluster offset of 0) when an
>                      external data file is used."
>
> In other words, cluster can't be unallocated with data file in use.

I still don't follow... clusters can be unallocated, and when you create
a new image they are indeed unallocated.

Bit 63 (QCOW_OFLAG_COPIED) is what indicates if a cluster is allocated
or not, and you can unmap an allocated cluster with 'write -z -u'.

Berto
Vladimir Sementsov-Ogievskiy April 14, 2020, 6:06 p.m. UTC | #8
14.04.2020 19:30, Alberto Garcia wrote:
> On Tue 14 Apr 2020 06:19:18 PM CEST, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>>> It still may cache information about zeroed subclusters: gives more
>>>> detailed block-status. But we should mention somehow external
>>>> files. Hm. not only for raw external files, but it is documented that
>>>> cluster can't be unallocated when an external data file is used.
>>>
>>> What do you mean by "cluster can't be unallocated" ?
>>
>> I mean this sentence from qcow2.txt:
>>
>>                      "The offset may only be 0 with
>>                       bit 63 set (indicating a host cluster offset of 0) when an
>>                       external data file is used."
>>
>> In other words, cluster can't be unallocated with data file in use.
> 
> I still don't follow... clusters can be unallocated, and when you create
> a new image they are indeed unallocated.

with external data file? Than we probably need to fix spec..

unallocated mean that offset is 0, and bit 63 is unset. But this can't be when and exernal data file is used, accordingly to the spec.

Or what am I missing?
Alberto Garcia April 14, 2020, 6:13 p.m. UTC | #9
On Tue 14 Apr 2020 08:06:38 PM CEST, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
>>> In other words, cluster can't be unallocated with data file in use.
>> 
>> I still don't follow... clusters can be unallocated, and when you
>> create a new image they are indeed unallocated.
>
> with external data file? Than we probably need to fix spec..
>
> unallocated mean that offset is 0, and bit 63 is unset. But this can't
> be when and exernal data file is used, accordingly to the spec.
>
> Or what am I missing?

   $ qemu-img create -f qcow2 -o data_file=data.raw img.qcow2 1M
   $ qemu-io -c 'write 0 192k' img.qcow2 
   $ qemu-io -c 'write -z -u 64k 64k' img.qcow2 

Clusters #0 and #2 are allocated (offsets 0x00000 and 0x20000), cluster
#1 is unallocated (offset 0, bit 63 unset, bit 0 -all zeroes- set).

Berto
Alberto Garcia April 14, 2020, 6:16 p.m. UTC | #10
On Thu 09 Apr 2020 05:12:16 PM CEST, Eric Blake <eblake@redhat.com> wrote:
> Hmm - raw external files are incompatible with backing files.

Pre-existing, but I just realized that we are not checking that in
qcow2_do_open(), only on _create().

I suppose that if we find such an image we should either

   a) Show an error message and abort.
   b) Clear the 'raw data file' bit and proceed as if it was unset.

Berto
Eric Blake April 14, 2020, 6:23 p.m. UTC | #11
On 4/14/20 1:16 PM, Alberto Garcia wrote:
> On Thu 09 Apr 2020 05:12:16 PM CEST, Eric Blake <eblake@redhat.com> wrote:
>> Hmm - raw external files are incompatible with backing files.
> 
> Pre-existing, but I just realized that we are not checking that in
> qcow2_do_open(), only on _create().
> 
> I suppose that if we find such an image we should either
> 
>     a) Show an error message and abort.
>     b) Clear the 'raw data file' bit and proceed as if it was unset.

I would favor a).  Such an image was (hopefully) created externally, and 
not by qemu; therefore refusing to open it will call attention to the 
image (and it's creation process) being broken, rather than risking 
silent corruption of whatever the external process thought it was 
accomplishing by creating an image like that.
Eric Blake April 14, 2020, 6:25 p.m. UTC | #12
On 4/14/20 1:23 PM, Eric Blake wrote:
> On 4/14/20 1:16 PM, Alberto Garcia wrote:
>> On Thu 09 Apr 2020 05:12:16 PM CEST, Eric Blake <eblake@redhat.com> 
>> wrote:
>>> Hmm - raw external files are incompatible with backing files.
>>
>> Pre-existing, but I just realized that we are not checking that in
>> qcow2_do_open(), only on _create().
>>
>> I suppose that if we find such an image we should either
>>
>>     a) Show an error message and abort.
>>     b) Clear the 'raw data file' bit and proceed as if it was unset.
> 
> I would favor a).  Such an image was (hopefully) created externally, and 
> not by qemu; therefore refusing to open it will call attention to the 
> image (and it's creation process) being broken, rather than risking 
> silent corruption of whatever the external process thought it was 
> accomplishing by creating an image like that.

Also, 'qemu-img check' should flag the problem, and I'd be okay with 
'qemu-img check -r all' repairing the problem by method b) (because then 
the user is explicitly opting in to having qemu change the image in 
order to maximize the amount of data that qemu can then extract from the 
image).
Alberto Garcia April 15, 2020, 7:11 p.m. UTC | #13
On Fri 10 Apr 2020 11:29:59 AM CEST, Vladimir Sementsov-Ogievskiy wrote:
>> Should we also document that extended L2 entries are incompatible
>> with raw external files? [...] After all, when raw external file is
>> enabled, the entire image is allocated, at which point subclusters
>> don't make much sense.
>
> It still may cache information about zeroed subclusters: gives more
> detailed block-status.

So shall I forbid extended_l2 + data_file_raw then?

I wonder, if the only problem is that it's just not very useful, does it
make sense to add additional complexity and restrictions to the code
simply to prevent the user from making a sub-optimal choice?

Berto
Eric Blake April 15, 2020, 9:13 p.m. UTC | #14
On 4/15/20 2:11 PM, Alberto Garcia wrote:
> On Fri 10 Apr 2020 11:29:59 AM CEST, Vladimir Sementsov-Ogievskiy wrote:
>>> Should we also document that extended L2 entries are incompatible
>>> with raw external files? [...] After all, when raw external file is
>>> enabled, the entire image is allocated, at which point subclusters
>>> don't make much sense.
>>
>> It still may cache information about zeroed subclusters: gives more
>> detailed block-status.

That's a good point about one reason why it might be useful.

> 
> So shall I forbid extended_l2 + data_file_raw then?
> 
> I wonder, if the only problem is that it's just not very useful, does it
> make sense to add additional complexity and restrictions to the code
> simply to prevent the user from making a sub-optimal choice?

At this point, I'm not seeing a technical reason why we have to forbid 
subclusters with data-file-raw.  Mixing may be inefficient compared to 
using raw-data-file without subclusters, but inefficiencies are not 
worth the code bloat to forbid the combination.  If we come up with a 
scenario where the mix would cause data corruption, that's a different 
story, but I'm not seeing such a reason at the moment.
diff mbox series

Patch

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index 5597e24474..2e8cad38c4 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -39,6 +39,9 @@  The first cluster of a qcow2 image contains the file header:
                     as the maximum cluster size and won't be able to open images
                     with larger cluster sizes.
 
+                    Note: if the image has Extended L2 Entries then cluster_bits
+                    must be at least 14 (i.e. 16384 byte clusters).
+
          24 - 31:   size
                     Virtual disk size in bytes.
 
@@ -114,7 +117,12 @@  the next fields through header_length.
                                 clusters. The compression_type field must be
                                 present and not zero.
 
-                    Bits 4-63:  Reserved (set to 0)
+                    Bit 4:      Extended L2 Entries.  If this bit is set then
+                                L2 table entries use an extended format that
+                                allows subcluster-based allocation. See the
+                                Extended L2 Entries section for more details.
+
+                    Bits 5-63:  Reserved (set to 0)
 
          80 -  87:  compatible_features
                     Bitmask of compatible features. An implementation can
@@ -493,7 +501,7 @@  cannot be relaxed without an incompatible layout change).
 Given an offset into the virtual disk, the offset into the image file can be
 obtained as follows:
 
-    l2_entries = (cluster_size / sizeof(uint64_t))
+    l2_entries = (cluster_size / sizeof(uint64_t))        [*]
 
     l2_index = (offset / cluster_size) % l2_entries
     l1_index = (offset / cluster_size) / l2_entries
@@ -503,6 +511,8 @@  obtained as follows:
 
     return cluster_offset + (offset % cluster_size)
 
+    [*] this changes if Extended L2 Entries are enabled, see next section
+
 L1 table entry:
 
     Bit  0 -  8:    Reserved (set to 0)
@@ -543,7 +553,8 @@  Standard Cluster Descriptor:
                     nor is data read from the backing file if the cluster is
                     unallocated.
 
-                    With version 2, this is always 0.
+                    With version 2 or with extended L2 entries (see the next
+                    section), this is always 0.
 
          1 -  8:    Reserved (set to 0)
 
@@ -580,6 +591,57 @@  file (except if bit 0 in the Standard Cluster Descriptor is set). If there is
 no backing file or the backing file is smaller than the image, they shall read
 zeros for all parts that are not covered by the backing file.
 
+== Extended L2 Entries ==
+
+An image uses Extended L2 Entries if bit 4 is set on the incompatible_features
+field of the header.
+
+In these images standard data clusters are divided into 32 subclusters of the
+same size. They are contiguous and start from the beginning of the cluster.
+Subclusters can be allocated independently and the L2 entry contains information
+indicating the status of each one of them. Compressed data clusters don't have
+subclusters so they are treated the same as in images without this feature.
+
+The size of an extended L2 entry is 128 bits so the number of entries per table
+is calculated using this formula:
+
+    l2_entries = (cluster_size / (2 * sizeof(uint64_t)))
+
+The first 64 bits have the same format as the standard L2 table entry described
+in the previous section, with the exception of bit 0 of the standard cluster
+descriptor.
+
+The last 64 bits contain a subcluster allocation bitmap with this format:
+
+Subcluster Allocation Bitmap (for standard clusters):
+
+    Bit  0 -  31:   Allocation status (one bit per subcluster)
+
+                    1: the subcluster is allocated. In this case the
+                       host cluster offset field must contain a valid
+                       offset.
+                    0: the subcluster is not allocated. In this case
+                       read requests shall go to the backing file or
+                       return zeros if there is no backing file data.
+
+                    Bits are assigned starting from the least significant
+                    one (i.e. bit x is used for subcluster x).
+
+        32 -  63    Subcluster reads as zeros (one bit per subcluster)
+
+                    1: the subcluster reads as zeros. In this case the
+                       allocation status bit must be unset. The host
+                       cluster offset field may or may not be set.
+                    0: no effect.
+
+                    Bits are assigned starting from the least significant
+                    one (i.e. bit x is used for subcluster x - 32).
+
+Subcluster Allocation Bitmap (for compressed clusters):
+
+    Bit  0 -  63:   Reserved (set to 0)
+                    Compressed clusters don't have subclusters,
+                    so this field is not used.
 
 == Snapshots ==
 
diff --git a/docs/qcow2-cache.txt b/docs/qcow2-cache.txt
index d57f409861..5f763aa6bb 100644
--- a/docs/qcow2-cache.txt
+++ b/docs/qcow2-cache.txt
@@ -1,6 +1,6 @@ 
 qcow2 L2/refcount cache configuration
 =====================================
-Copyright (C) 2015, 2018 Igalia, S.L.
+Copyright (C) 2015, 2018-2020 Igalia, S.L.
 Author: Alberto Garcia <berto@igalia.com>
 
 This work is licensed under the terms of the GNU GPL, version 2 or
@@ -222,3 +222,20 @@  support this functionality, and is 0 (disabled) on other platforms.
 This functionality currently relies on the MADV_DONTNEED argument for
 madvise() to actually free the memory. This is a Linux-specific feature,
 so cache-clean-interval is not supported on other systems.
+
+
+Extended L2 Entries
+-------------------
+All numbers shown in this document are valid for qcow2 images with normal
+64-bit L2 entries.
+
+Images with extended L2 entries need twice as much L2 metadata, so the L2
+cache size must be twice as large for the same disk space.
+
+   disk_size = l2_cache_size * cluster_size / 16
+
+i.e.
+
+   l2_cache_size = disk_size * 16 / cluster_size
+
+Refcount blocks are not affected by this.