[1/1] qcow2: handle cluster leak happening with a guest TRIM command
diff mbox

Message ID 1495376506-13227-1-git-send-email-den@openvz.org
State New
Headers show

Commit Message

Denis V. Lunev May 21, 2017, 2:21 p.m. UTC
qemu-img create -f qcow2 1.img 64G
  qemu-io -c "write -P 0x32 0 64k" 1.img
results in
  324 -rw-r--r-- 1 den den 393216 May 21 16:48 1.img
Subsequent
  qemu-io -c "write -z 0 64k" 1.img
  qemu-io -c "write -P 0x32 0 64k" 1.img
results in
  388 -rw-r--r-- 1 den den 458752 May 21 16:50 1.img
which looks like we have 1 cluster leaked.

Indeed, qcow2_co_pwrite_zeroes calls qcow2_zero_clusters/zero_single_l2,
which does not update refcount for the host cluster and keep the offset
as used. Later on handle_copied() does not take into account
QCOW2_CLUSTER_ZERO type of the cluster.

For now we comes into a very bad situation after qcow2_zero_cluster.
We have a hole in the host file and we have the offset allocated for
that guest cluster. Now there are 2 options:
1) allocate new offset once the write will come into this guest
   cluster (actually happens, but original cluster offset is leaked)
2) re-use host offset, i.e. fix handle_copied() to allow to reuse
   offset not only for QCOW2_CLUSTER_NORMAL but for QCOW2_CLUSTER_ZERO
   too
Option 2) seems worse to me as in this case we can easily have host
fragmentation in that cluster if writes will come in small pieces.

This is not a very big deal if we have filesystem with PUCH_HOLE support,
but without this feature the cluster is actually leaked forever.

The patch replaces zero_single_l2 with discard_single_l2 and removes
now unused zero_single_l2 to fix the situation.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
---
 block/qcow2-cluster.c | 46 ++--------------------------------------------
 1 file changed, 2 insertions(+), 44 deletions(-)

Comments

Eric Blake May 22, 2017, 11:37 a.m. UTC | #1
On 05/21/2017 09:21 AM, Denis V. Lunev wrote:
>   qemu-img create -f qcow2 1.img 64G
>   qemu-io -c "write -P 0x32 0 64k" 1.img
> results in
>   324 -rw-r--r-- 1 den den 393216 May 21 16:48 1.img
> Subsequent
>   qemu-io -c "write -z 0 64k" 1.img
>   qemu-io -c "write -P 0x32 0 64k" 1.img
> results in
>   388 -rw-r--r-- 1 den den 458752 May 21 16:50 1.img
> which looks like we have 1 cluster leaked.
> 
> Indeed, qcow2_co_pwrite_zeroes calls qcow2_zero_clusters/zero_single_l2,
> which does not update refcount for the host cluster and keep the offset
> as used. Later on handle_copied() does not take into account
> QCOW2_CLUSTER_ZERO type of the cluster.

NACK.  We've already fixed this upstream, at commit 564a6b693.

$ ./qemu-img create -f qcow2 1.img 64g
Formatting '1.img', fmt=qcow2 size=68719476736 encryption=off
cluster_size=65536 lazy_refcounts=off refcount_bits=16
$ ./qemu-io -c 'w -P 0x32 0 64k' 1.img
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0.0410 sec (1.524 MiB/sec and 24.3902 ops/sec)
$ ls -l 1.img
-rw-r--r--. 1 eblake eblake 393216 May 22 06:35 1.img
$ ./qemu-io -c 'w -z 0 64k' 1.img
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0.0055 sec (11.345 MiB/sec and 181.5211 ops/sec)
$ ./qemu-io -c 'w -P 0x32 0 64k' 1.img
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0.0160 sec (3.903 MiB/sec and 62.4415 ops/sec)
$ ls -l 1.img
-rw-r--r--. 1 eblake eblake 393216 May 22 06:35 1.img


As it is, your patch doesn't apply to master.  And even if it did, your
patch breaks semantics - we CANNOT discard clusters that must read back
as zeroes.
Denis V. Lunev May 22, 2017, 11:56 a.m. UTC | #2
On 05/22/2017 02:37 PM, Eric Blake wrote:
> On 05/21/2017 09:21 AM, Denis V. Lunev wrote:
>>   qemu-img create -f qcow2 1.img 64G
>>   qemu-io -c "write -P 0x32 0 64k" 1.img
>> results in
>>   324 -rw-r--r-- 1 den den 393216 May 21 16:48 1.img
>> Subsequent
>>   qemu-io -c "write -z 0 64k" 1.img
>>   qemu-io -c "write -P 0x32 0 64k" 1.img
>> results in
>>   388 -rw-r--r-- 1 den den 458752 May 21 16:50 1.img
>> which looks like we have 1 cluster leaked.
>>
>> Indeed, qcow2_co_pwrite_zeroes calls qcow2_zero_clusters/zero_single_l2,
>> which does not update refcount for the host cluster and keep the offset
>> as used. Later on handle_copied() does not take into account
>> QCOW2_CLUSTER_ZERO type of the cluster.
> NACK.  We've already fixed this upstream, at commit 564a6b693.
My bad. I am updating sources too infrequently :(

> $ ./qemu-img create -f qcow2 1.img 64g
> Formatting '1.img', fmt=qcow2 size=68719476736 encryption=off
> cluster_size=65536 lazy_refcounts=off refcount_bits=16
> $ ./qemu-io -c 'w -P 0x32 0 64k' 1.img
> wrote 65536/65536 bytes at offset 0
> 64 KiB, 1 ops; 0.0410 sec (1.524 MiB/sec and 24.3902 ops/sec)
> $ ls -l 1.img
> -rw-r--r--. 1 eblake eblake 393216 May 22 06:35 1.img
> $ ./qemu-io -c 'w -z 0 64k' 1.img
> wrote 65536/65536 bytes at offset 0
> 64 KiB, 1 ops; 0.0055 sec (11.345 MiB/sec and 181.5211 ops/sec)
> $ ./qemu-io -c 'w -P 0x32 0 64k' 1.img
> wrote 65536/65536 bytes at offset 0
> 64 KiB, 1 ops; 0.0160 sec (3.903 MiB/sec and 62.4415 ops/sec)
> $ ls -l 1.img
> -rw-r--r--. 1 eblake eblake 393216 May 22 06:35 1.img
>
>
> As it is, your patch doesn't apply to master.  And even if it did, your
> patch breaks semantics - we CANNOT discard clusters that must read back
> as zeroes.
>
Can you pls give some details why the cluster can not be
discarded? This is something that I can not understand.

At my opinion the cluster is clearly marked with QCOW_OFLAG_ZERO
and thus will be read with zeroes regardless of the content of the
cluster which is pointed by the offset. This is actually what
happens on the FS without PUNCH_HOLE support.

That is why this offset can be actually reused for any other
block, what is I am trying to propose.

Den
Eric Blake May 22, 2017, 1:02 p.m. UTC | #3
On 05/22/2017 06:56 AM, Denis V. Lunev wrote:
>>
>>
>> As it is, your patch doesn't apply to master.  And even if it did, your
>> patch breaks semantics - we CANNOT discard clusters that must read back
>> as zeroes.
>>
> Can you pls give some details why the cluster can not be
> discarded? This is something that I can not understand.

The cluster CAN be discarded if the guest requests it.  There is a
difference between:

qemu-img -f qcow2 'w -z    0 64k' 1.img

which says to ensure that the cluster reads back as zero, but to NOT
unmap things (corresponding to NBD_CMD_WRITE_ZEROES with
NBD_CMD_FLAG_NO_HOLE set), and:

qemu-img -f qcow2 'w -z -u 0 64k' 1.img

which says to ensure that the cluster reads back as zero, and that
unmapping the cluster is permitted if that will make the operation
faster and/or use less space (corresponding to NBD_CMD_WRITE_ZEROES with
NBD_CMD_FLAG_NO_HOLE clear).

In current master, we've gone as far as to make an obvious difference
between the two types of qcow2 zero clusters (see commit fdfab37);
QCOW2_CLUSTER_ZERO_PLAIN corresponds to the case where the guest allowed
a cluster to be unmapped, while QCOW2_CLUSTER_ZERO_ALLOC corresponds to
the case where we've written zeroes but require the allocation mapping
to remain intact.

One of the big reasons why we do NOT want the guest to be able to
convert a cluster to QCOW2_CLUSTER_UNALLOCATED after a write zero
request (but only after a discard request) is that we don't want to
audit what will happen if a backing file is later added to the image.
If a guest does a discard, then they are acknowledging that the data in
that cluster can read back as anything (we try to make it read back as
zero where practical, but do not guarantee that).  But if a guest does a
write zero request, even if they permit unmapping, the semantics require
that they can read back zero, no matter what happens to the backing chain.

An example that Kevin used to convince me was the process of drive
mirroring: if you mirror just the active qcow2 layer, then cancel the
mirror, the destination is supposed to be a snapshot of what the guest
saw at that moment, once you rebase the destination image to be on top
of the same backing contents as the source saw.  That implies that I can
use 'qemu-img rebase -u' to inject a backing chain.  If we mirror a zero
cluster over to the destination, and make sure the destination sees a
QCOW2_CLUSTER_ZERO_PLAIN, then no matter what the backing chain we later
rebase the destination onto, we still read 0 for that cluster (as the
guest expected).  But if we mirror the zero cluster over and leave the
destination image with QCOW2_CLUSTER_UNALLOCATED, then supply a backing
chain where the cluster does not read as zero, then we have altered the
guest-visible contents.  So it is safer to ALWAYS require that a guest
write-zero action uses one of the two types of qcow2 zero clusters, to
keep the guest-visible contents of that cluster constant no matter what
rebasing actions later occur.

That said, we've also had a conversation on things we can do to improve
mirroring; one of the ideas was to add a new flag (comparable to the
existing BDRV_REQ_MAY_UNMAP) that mirroring (but not guest actions) can
use to request that zero_single_l2() is permitted to reuse
QCOW2_CLUSTER_UNALLOCATED instead of having to switch to
QCOW2_CLUSTER_ZERO_PLAIN, or where we can exploit PUNCH_HOLE semantics
on the underlying bs->file to efficiently write zeroes into the file
system even while keeping QCOW2_CLUSTER_ZERO_ALLOC at the qcow2 layer.

> 
> At my opinion the cluster is clearly marked with QCOW_OFLAG_ZERO
> and thus will be read with zeroes regardless of the content of the
> cluster which is pointed by the offset. This is actually what
> happens on the FS without PUNCH_HOLE support.
> 
> That is why this offset can be actually reused for any other
> block, what is I am trying to propose.

Yes, when using write zeros with unmap (qemu-io -c 'w -z -u ...'), then
the code already unmaps the cluster, so that it can be reused for any
other block.  So it boils down to whether your guest is really
requesting write zeroes with permission to unmap, and whether any other
options you have specified regarding the disk are filtering out
BDRV_REQ_MAY_UNMAP before it reaches the qcow2 layer.
Denis V. Lunev May 23, 2017, 2:52 p.m. UTC | #4
On 05/22/2017 04:02 PM, Eric Blake wrote:
> On 05/22/2017 06:56 AM, Denis V. Lunev wrote:
>>>
>>> As it is, your patch doesn't apply to master.  And even if it did, your
>>> patch breaks semantics - we CANNOT discard clusters that must read back
>>> as zeroes.
>>>
>> Can you pls give some details why the cluster can not be
>> discarded? This is something that I can not understand.
> The cluster CAN be discarded if the guest requests it.  There is a
> difference between:
>
> qemu-img -f qcow2 'w -z    0 64k' 1.img
>
> which says to ensure that the cluster reads back as zero, but to NOT
> unmap things (corresponding to NBD_CMD_WRITE_ZEROES with
> NBD_CMD_FLAG_NO_HOLE set), and:
>
> qemu-img -f qcow2 'w -z -u 0 64k' 1.img
>
> which says to ensure that the cluster reads back as zero, and that
> unmapping the cluster is permitted if that will make the operation
> faster and/or use less space (corresponding to NBD_CMD_WRITE_ZEROES with
> NBD_CMD_FLAG_NO_HOLE clear).
>
> In current master, we've gone as far as to make an obvious difference
> between the two types of qcow2 zero clusters (see commit fdfab37);
> QCOW2_CLUSTER_ZERO_PLAIN corresponds to the case where the guest allowed
> a cluster to be unmapped, while QCOW2_CLUSTER_ZERO_ALLOC corresponds to
> the case where we've written zeroes but require the allocation mapping
> to remain intact.
>
> One of the big reasons why we do NOT want the guest to be able to
> convert a cluster to QCOW2_CLUSTER_UNALLOCATED after a write zero
> request (but only after a discard request) is that we don't want to
> audit what will happen if a backing file is later added to the image.
> If a guest does a discard, then they are acknowledging that the data in
> that cluster can read back as anything (we try to make it read back as
> zero where practical, but do not guarantee that).  But if a guest does a
> write zero request, even if they permit unmapping, the semantics require
> that they can read back zero, no matter what happens to the backing chain.
>
> An example that Kevin used to convince me was the process of drive
> mirroring: if you mirror just the active qcow2 layer, then cancel the
> mirror, the destination is supposed to be a snapshot of what the guest
> saw at that moment, once you rebase the destination image to be on top
> of the same backing contents as the source saw.  That implies that I can
> use 'qemu-img rebase -u' to inject a backing chain.  If we mirror a zero
> cluster over to the destination, and make sure the destination sees a
> QCOW2_CLUSTER_ZERO_PLAIN, then no matter what the backing chain we later
> rebase the destination onto, we still read 0 for that cluster (as the
> guest expected).  But if we mirror the zero cluster over and leave the
> destination image with QCOW2_CLUSTER_UNALLOCATED, then supply a backing
> chain where the cluster does not read as zero, then we have altered the
> guest-visible contents.  So it is safer to ALWAYS require that a guest
> write-zero action uses one of the two types of qcow2 zero clusters, to
> keep the guest-visible contents of that cluster constant no matter what
> rebasing actions later occur.
>
> That said, we've also had a conversation on things we can do to improve
> mirroring; one of the ideas was to add a new flag (comparable to the
> existing BDRV_REQ_MAY_UNMAP) that mirroring (but not guest actions) can
> use to request that zero_single_l2() is permitted to reuse
> QCOW2_CLUSTER_UNALLOCATED instead of having to switch to
> QCOW2_CLUSTER_ZERO_PLAIN, or where we can exploit PUNCH_HOLE semantics
> on the underlying bs->file to efficiently write zeroes into the file
> system even while keeping QCOW2_CLUSTER_ZERO_ALLOC at the qcow2 layer.
>
>> At my opinion the cluster is clearly marked with QCOW_OFLAG_ZERO
>> and thus will be read with zeroes regardless of the content of the
>> cluster which is pointed by the offset. This is actually what
>> happens on the FS without PUNCH_HOLE support.
>>
>> That is why this offset can be actually reused for any other
>> block, what is I am trying to propose.
> Yes, when using write zeros with unmap (qemu-io -c 'w -z -u ...'), then
> the code already unmaps the cluster, so that it can be reused for any
> other block.  So it boils down to whether your guest is really
> requesting write zeroes with permission to unmap, and whether any other
> options you have specified regarding the disk are filtering out
> BDRV_REQ_MAY_UNMAP before it reaches the qcow2 layer.
>
this is very clear now. Thank you very much for this letter.

Thank you and best regards,
    Den

Patch
diff mbox

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 100398c..1e53a7c 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1548,49 +1548,6 @@  fail:
     return ret;
 }
 
-/*
- * This zeroes as many clusters of nb_clusters as possible at once (i.e.
- * all clusters in the same L2 table) and returns the number of zeroed
- * clusters.
- */
-static int zero_single_l2(BlockDriverState *bs, uint64_t offset,
-                          uint64_t nb_clusters, int flags)
-{
-    BDRVQcow2State *s = bs->opaque;
-    uint64_t *l2_table;
-    int l2_index;
-    int ret;
-    int i;
-
-    ret = get_cluster_table(bs, offset, &l2_table, &l2_index);
-    if (ret < 0) {
-        return ret;
-    }
-
-    /* Limit nb_clusters to one L2 table */
-    nb_clusters = MIN(nb_clusters, s->l2_size - l2_index);
-    assert(nb_clusters <= INT_MAX);
-
-    for (i = 0; i < nb_clusters; i++) {
-        uint64_t old_offset;
-
-        old_offset = be64_to_cpu(l2_table[l2_index + i]);
-
-        /* Update L2 entries */
-        qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table);
-        if (old_offset & QCOW_OFLAG_COMPRESSED || flags & BDRV_REQ_MAY_UNMAP) {
-            l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO);
-            qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
-        } else {
-            l2_table[l2_index + i] |= cpu_to_be64(QCOW_OFLAG_ZERO);
-        }
-    }
-
-    qcow2_cache_put(bs, s->l2_table_cache, (void **) &l2_table);
-
-    return nb_clusters;
-}
-
 int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int nb_sectors,
                         int flags)
 {
@@ -1609,7 +1566,8 @@  int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int nb_sectors,
     s->cache_discards = true;
 
     while (nb_clusters > 0) {
-        ret = zero_single_l2(bs, offset, nb_clusters, flags);
+        ret = discard_single_l2(bs, offset, nb_clusters,
+                                QCOW2_DISCARD_REQUEST, false);
         if (ret < 0) {
             goto fail;
         }