diff mbox series

[v7,28/32] qcow2: Add subcluster support to qcow2_co_pwrite_zeroes()

Message ID e037ed54599e7bf4d76bf8cd8db1904a20ffc6dd.1590429901.git.berto@igalia.com (mailing list archive)
State New, archived
Headers show
Series Add subcluster allocation to qcow2 | expand

Commit Message

Alberto Garcia May 25, 2020, 6:08 p.m. UTC
This works now at the subcluster level and pwrite_zeroes_alignment is
updated accordingly.

qcow2_cluster_zeroize() is turned into qcow2_subcluster_zeroize() with
the following changes:

   - The request can now be subcluster-aligned.

   - The cluster-aligned body of the request is still zeroized using
     zero_in_l2_slice() as before.

   - The subcluster-aligned head and tail of the request are zeroized
     with the new zero_l2_subclusters() function.

There is just one thing to take into account for a possible future
improvement: compressed clusters cannot be partially zeroized so
zero_l2_subclusters() on the head or the tail can return -ENOTSUP.
This makes the caller repeat the *complete* request and write actual
zeroes to disk. This is sub-optimal because

   1) if the head area was compressed we would still be able to use
      the fast path for the body and possibly the tail.

   2) if the tail area was compressed we are writing zeroes to the
      head and the body areas, which are already zeroized.

Signed-off-by: Alberto Garcia <berto@igalia.com>
---
 block/qcow2.h         |  4 +--
 block/qcow2-cluster.c | 80 +++++++++++++++++++++++++++++++++++++++----
 block/qcow2.c         | 27 ++++++++-------
 3 files changed, 90 insertions(+), 21 deletions(-)

Comments

Eric Blake May 27, 2020, 5:58 p.m. UTC | #1
On 5/25/20 1:08 PM, Alberto Garcia wrote:
> This works now at the subcluster level and pwrite_zeroes_alignment is
> updated accordingly.
> 
> qcow2_cluster_zeroize() is turned into qcow2_subcluster_zeroize() with
> the following changes:
> 
>     - The request can now be subcluster-aligned.
> 
>     - The cluster-aligned body of the request is still zeroized using
>       zero_in_l2_slice() as before.
> 
>     - The subcluster-aligned head and tail of the request are zeroized
>       with the new zero_l2_subclusters() function.
> 
> There is just one thing to take into account for a possible future
> improvement: compressed clusters cannot be partially zeroized so
> zero_l2_subclusters() on the head or the tail can return -ENOTSUP.
> This makes the caller repeat the *complete* request and write actual
> zeroes to disk. This is sub-optimal because
> 
>     1) if the head area was compressed we would still be able to use
>        the fast path for the body and possibly the tail.
> 
>     2) if the tail area was compressed we are writing zeroes to the
>        head and the body areas, which are already zeroized.

Is this true?  The block layer tries hard to break zero requests up so 
that any non-cluster-aligned requests do not cross cluster boundaries. 
In practice, that means that when you have an unaligned request, the 
head and tail cluster will be the same cluster, and there is no body in 
play, so that returning -ENOTSUP is correct because there really is no 
other work to do and repeating the entire request (which is less than a 
cluster in length) is the right approach.

> 
> Signed-off-by: Alberto Garcia <berto@igalia.com>
> ---
>   block/qcow2.h         |  4 +--
>   block/qcow2-cluster.c | 80 +++++++++++++++++++++++++++++++++++++++----
>   block/qcow2.c         | 27 ++++++++-------
>   3 files changed, 90 insertions(+), 21 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>
Alberto Garcia May 28, 2020, 3:04 p.m. UTC | #2
On Wed 27 May 2020 07:58:10 PM CEST, Eric Blake wrote:
>> There is just one thing to take into account for a possible future
>> improvement: compressed clusters cannot be partially zeroized so
>> zero_l2_subclusters() on the head or the tail can return -ENOTSUP.
>> This makes the caller repeat the *complete* request and write actual
>> zeroes to disk. This is sub-optimal because
>> 
>>     1) if the head area was compressed we would still be able to use
>>        the fast path for the body and possibly the tail.
>> 
>>     2) if the tail area was compressed we are writing zeroes to the
>>        head and the body areas, which are already zeroized.
>
> Is this true?  The block layer tries hard to break zero requests up so 
> that any non-cluster-aligned requests do not cross cluster boundaries. 
> In practice, that means that when you have an unaligned request, the 
> head and tail cluster will be the same cluster, and there is no body in 
> play, so that returning -ENOTSUP is correct because there really is no 
> other work to do and repeating the entire request (which is less than a 
> cluster in length) is the right approach.

Let's use an example.

cluster size is 64KB, subcluster size is 2KB, and we get this request:

   write -z 31k 130k

Since pwrite_zeroes_alignment equals the cluster size (64KB), this
would result in 3 calls to qcow2_co_pwrite_zeroes():

   offset=31k  size=33k    [-ENOTSUP, writes actual zeros]
   offset=64k  size=64k    [zeroized using the relevant metadata bits]
   offset=128k size=33k    [-ENOTSUP, writes actual zeros]

However this patch changes the alignment:

    bs->bl.pwrite_zeroes_alignment = s->subcluster_size;

so we get these instead:

   offset=31k  size=1k     [-ENOTSUP, writes actual zeros]
   offset=32k  size=128k   [zeroized using the relevant metadata bits]
   offset=160k size=1k     [-ENOTSUP, writes actual zeros]

So far, so good. Reducing the alignment requirements allows us to
maximize the number of subclusters to zeroize.

Now let's suppose we have this request:

   write -z 32k 128k

This one is aligned so it goes directly to qcow2_co_pwrite_zeroes().
However if the third cluster is compressed then the function will
return -ENOTSUP after having zeroized the first 96KB of the request,
forcing the caller to repeat it completely using the slow path.

I think the problem also exists in the current code (without my
patches). If you zeroize 10 clusters and the last one is compressed
you have to repeat the request after having zeroized 9 clusters.

Berto
Eric Blake May 28, 2020, 7:11 p.m. UTC | #3
On 5/28/20 10:04 AM, Alberto Garcia wrote:
> On Wed 27 May 2020 07:58:10 PM CEST, Eric Blake wrote:
>>> There is just one thing to take into account for a possible future
>>> improvement: compressed clusters cannot be partially zeroized so
>>> zero_l2_subclusters() on the head or the tail can return -ENOTSUP.
>>> This makes the caller repeat the *complete* request and write actual
>>> zeroes to disk. This is sub-optimal because
>>>
>>>      1) if the head area was compressed we would still be able to use
>>>         the fast path for the body and possibly the tail.
>>>
>>>      2) if the tail area was compressed we are writing zeroes to the
>>>         head and the body areas, which are already zeroized.
>>
>> Is this true?  The block layer tries hard to break zero requests up so
>> that any non-cluster-aligned requests do not cross cluster boundaries.
>> In practice, that means that when you have an unaligned request, the
>> head and tail cluster will be the same cluster, and there is no body in
>> play, so that returning -ENOTSUP is correct because there really is no
>> other work to do and repeating the entire request (which is less than a
>> cluster in length) is the right approach.
> 
> Let's use an example.
> 
> cluster size is 64KB, subcluster size is 2KB, and we get this request:
> 
>     write -z 31k 130k
> 
> Since pwrite_zeroes_alignment equals the cluster size (64KB), this
> would result in 3 calls to qcow2_co_pwrite_zeroes():
> 
>     offset=31k  size=33k    [-ENOTSUP, writes actual zeros]
>     offset=64k  size=64k    [zeroized using the relevant metadata bits]
>     offset=128k size=33k    [-ENOTSUP, writes actual zeros]
> 
> However this patch changes the alignment:
> 
>      bs->bl.pwrite_zeroes_alignment = s->subcluster_size;

Ah, I missed that trick.  But it is nice, and indeed...

> 
> so we get these instead:
> 
>     offset=31k  size=1k     [-ENOTSUP, writes actual zeros]
>     offset=32k  size=128k   [zeroized using the relevant metadata bits]
>     offset=160k size=1k     [-ENOTSUP, writes actual zeros]
> 
> So far, so good. Reducing the alignment requirements allows us to
> maximize the number of subclusters to zeroize.

...we can now hit a request that is not cluster-aligned.

> 
> Now let's suppose we have this request:
> 
>     write -z 32k 128k
> 
> This one is aligned so it goes directly to qcow2_co_pwrite_zeroes().
> However if the third cluster is compressed then the function will
> return -ENOTSUP after having zeroized the first 96KB of the request,
> forcing the caller to repeat it completely using the slow path.
> 
> I think the problem also exists in the current code (without my
> patches). If you zeroize 10 clusters and the last one is compressed
> you have to repeat the request after having zeroized 9 clusters.

Hmm. In the pre-patch code, qcow2_co_pwrite_zeroes() calls 
qcow2_cluster_zeroize() which can fail with -ENOTSUP up front, but not 
after the fact.  Once it starts the while loop over clusters, its use of 
zero_in_l2_slice() handles compressed clusters just fine; as far as I 
can tell, only your new subcluster handling lets it now fail with 
-ENOTSUP after earlier clusters have been visited.

But isn't this something we could solve recursively?  Instead of 
returning -ENOTSUP, we could have zero_in_l2_slice() call 
bdrv_pwrite_zeroes() on the (sub-)clusters associated with a compressed 
cluster.
Alberto Garcia May 29, 2020, 4:06 p.m. UTC | #4
On Thu 28 May 2020 09:11:07 PM CEST, Eric Blake wrote:
>> I think the problem also exists in the current code (without my
>> patches). If you zeroize 10 clusters and the last one is compressed
>> you have to repeat the request after having zeroized 9 clusters.
>
> Hmm. In the pre-patch code, qcow2_co_pwrite_zeroes() calls
> qcow2_cluster_zeroize() which can fail with -ENOTSUP up front, but not
> after the fact.  Once it starts the while loop over clusters, its use
> of zero_in_l2_slice() handles compressed clusters just fine;

You're right, complete compressed clusters can always be handled, the
problem is just when there's subclusters.

> But isn't this something we could solve recursively?  Instead of
> returning -ENOTSUP, we could have zero_in_l2_slice() call
> bdrv_pwrite_zeroes() on the (sub-)clusters associated with a
> compressed cluster.

I suppose we could, as long as BDRV_REQ_NO_FALLBACK is not used.

Berto
diff mbox series

Patch

diff --git a/block/qcow2.h b/block/qcow2.h
index 32c68ead9a..ece5f1cb5a 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -898,8 +898,8 @@  void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m);
 int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
                           uint64_t bytes, enum qcow2_discard_type type,
                           bool full_discard);
-int qcow2_cluster_zeroize(BlockDriverState *bs, uint64_t offset,
-                          uint64_t bytes, int flags);
+int qcow2_subcluster_zeroize(BlockDriverState *bs, uint64_t offset,
+                             uint64_t bytes, int flags);
 
 int qcow2_expand_zero_clusters(BlockDriverState *bs,
                                BlockDriverAmendStatusCB *status_cb,
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index deff838fe8..1641976028 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2015,12 +2015,58 @@  static int zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
     return nb_clusters;
 }
 
-int qcow2_cluster_zeroize(BlockDriverState *bs, uint64_t offset,
-                          uint64_t bytes, int flags)
+static int zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+                               unsigned nb_subclusters)
+{
+    BDRVQcow2State *s = bs->opaque;
+    uint64_t *l2_slice;
+    uint64_t old_l2_bitmap, l2_bitmap;
+    int l2_index, ret, sc = offset_to_sc_index(s, offset);
+
+    /* For full clusters use zero_in_l2_slice() instead */
+    assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
+    assert(sc + nb_subclusters <= s->subclusters_per_cluster);
+
+    ret = get_cluster_table(bs, offset, &l2_slice, &l2_index);
+    if (ret < 0) {
+        return ret;
+    }
+
+    switch (qcow2_get_cluster_type(bs, get_l2_entry(s, l2_slice, l2_index))) {
+    case QCOW2_CLUSTER_COMPRESSED:
+        ret = -ENOTSUP; /* We cannot partially zeroize compressed clusters */
+        goto out;
+    case QCOW2_CLUSTER_NORMAL:
+    case QCOW2_CLUSTER_UNALLOCATED:
+        break;
+    default:
+        g_assert_not_reached();
+    }
+
+    old_l2_bitmap = l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
+
+    l2_bitmap |=  QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
+    l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+
+    if (old_l2_bitmap != l2_bitmap) {
+        set_l2_bitmap(s, l2_slice, l2_index, l2_bitmap);
+        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
+    }
+
+    ret = 0;
+out:
+    qcow2_cache_put(s->l2_table_cache, (void **) &l2_slice);
+
+    return ret;
+}
+
+int qcow2_subcluster_zeroize(BlockDriverState *bs, uint64_t offset,
+                             uint64_t bytes, int flags)
 {
     BDRVQcow2State *s = bs->opaque;
     uint64_t end_offset = offset + bytes;
     uint64_t nb_clusters;
+    unsigned head, tail;
     int64_t cleared;
     int ret;
 
@@ -2035,8 +2081,8 @@  int qcow2_cluster_zeroize(BlockDriverState *bs, uint64_t offset,
     }
 
     /* Caller must pass aligned values, except at image end */
-    assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
-    assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
+    assert(offset_into_subcluster(s, offset) == 0);
+    assert(offset_into_subcluster(s, end_offset) == 0 ||
            end_offset >= bs->total_sectors << BDRV_SECTOR_BITS);
 
     /* The zero flag is only supported by version 3 and newer */
@@ -2044,11 +2090,26 @@  int qcow2_cluster_zeroize(BlockDriverState *bs, uint64_t offset,
         return -ENOTSUP;
     }
 
-    /* Each L2 slice is handled by its own loop iteration */
-    nb_clusters = size_to_clusters(s, bytes);
+    head = MIN(end_offset, ROUND_UP(offset, s->cluster_size)) - offset;
+    offset += head;
+
+    tail = (end_offset >= bs->total_sectors << BDRV_SECTOR_BITS) ? 0 :
+        end_offset - MAX(offset, start_of_cluster(s, end_offset));
+    end_offset -= tail;
 
     s->cache_discards = true;
 
+    if (head) {
+        ret = zero_l2_subclusters(bs, offset - head,
+                                  size_to_subclusters(s, head));
+        if (ret < 0) {
+            goto fail;
+        }
+    }
+
+    /* Each L2 slice is handled by its own loop iteration */
+    nb_clusters = size_to_clusters(s, end_offset - offset);
+
     while (nb_clusters > 0) {
         cleared = zero_in_l2_slice(bs, offset, nb_clusters, flags);
         if (cleared < 0) {
@@ -2060,6 +2121,13 @@  int qcow2_cluster_zeroize(BlockDriverState *bs, uint64_t offset,
         offset += (cleared * s->cluster_size);
     }
 
+    if (tail) {
+        ret = zero_l2_subclusters(bs, end_offset, size_to_subclusters(s, tail));
+        if (ret < 0) {
+            goto fail;
+        }
+    }
+
     ret = 0;
 fail:
     s->cache_discards = false;
diff --git a/block/qcow2.c b/block/qcow2.c
index 430b4e423a..40988fff55 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1903,7 +1903,7 @@  static void qcow2_refresh_limits(BlockDriverState *bs, Error **errp)
         /* Encryption works on a sector granularity */
         bs->bl.request_alignment = qcrypto_block_get_sector_size(s->crypto);
     }
-    bs->bl.pwrite_zeroes_alignment = s->cluster_size;
+    bs->bl.pwrite_zeroes_alignment = s->subcluster_size;
     bs->bl.pdiscard_alignment = s->cluster_size;
 }
 
@@ -3840,8 +3840,9 @@  static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
     int ret;
     BDRVQcow2State *s = bs->opaque;
 
-    uint32_t head = offset % s->cluster_size;
-    uint32_t tail = (offset + bytes) % s->cluster_size;
+    uint32_t head = offset_into_subcluster(s, offset);
+    uint32_t tail = ROUND_UP(offset + bytes, s->subcluster_size) -
+        (offset + bytes);
 
     trace_qcow2_pwrite_zeroes_start_req(qemu_coroutine_self(), offset, bytes);
     if (offset + bytes == bs->total_sectors * BDRV_SECTOR_SIZE) {
@@ -3853,20 +3854,19 @@  static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
         unsigned int nr;
         QCow2SubclusterType type;
 
-        assert(head + bytes <= s->cluster_size);
+        assert(head + bytes + tail <= s->subcluster_size);
 
         /* check whether remainder of cluster already reads as zero */
         if (!(is_zero(bs, offset - head, head) &&
-              is_zero(bs, offset + bytes,
-                      tail ? s->cluster_size - tail : 0))) {
+              is_zero(bs, offset + bytes, tail))) {
             return -ENOTSUP;
         }
 
         qemu_co_mutex_lock(&s->lock);
         /* We can have new write after previous check */
-        offset = QEMU_ALIGN_DOWN(offset, s->cluster_size);
-        bytes = s->cluster_size;
-        nr = s->cluster_size;
+        offset -= head;
+        bytes = s->subcluster_size;
+        nr = s->subcluster_size;
         ret = qcow2_get_host_offset(bs, offset, &nr, &off, &type);
         if (ret < 0 ||
             (type != QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN &&
@@ -3882,8 +3882,8 @@  static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
 
     trace_qcow2_pwrite_zeroes(qemu_coroutine_self(), offset, bytes);
 
-    /* Whatever is left can use real zero clusters */
-    ret = qcow2_cluster_zeroize(bs, offset, bytes, flags);
+    /* Whatever is left can use real zero subclusters */
+    ret = qcow2_subcluster_zeroize(bs, offset, bytes, flags);
     qemu_co_mutex_unlock(&s->lock);
 
     return ret;
@@ -4356,12 +4356,13 @@  static int coroutine_fn qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
         uint64_t zero_start = QEMU_ALIGN_UP(old_length, s->cluster_size);
 
         /*
-         * Use zero clusters as much as we can. qcow2_cluster_zeroize()
+         * Use zero clusters as much as we can. qcow2_subcluster_zeroize()
          * requires a cluster-aligned start. The end may be unaligned if it is
          * at the end of the image (which it is here).
          */
         if (offset > zero_start) {
-            ret = qcow2_cluster_zeroize(bs, zero_start, offset - zero_start, 0);
+            ret = qcow2_subcluster_zeroize(bs, zero_start, offset - zero_start,
+                                           0);
             if (ret < 0) {
                 error_setg_errno(errp, -ret, "Failed to zero out new clusters");
                 goto fail;