[02/10] qcow2: add qcow2_co_write_compressed
diff mbox

Message ID 1463229957-14253-3-git-send-email-den@openvz.org
State New
Headers show

Commit Message

Denis V. Lunev May 14, 2016, 12:45 p.m. UTC
From: Pavel Butsykin <pbutsykin@virtuozzo.com>

Added implementation of the qcow2_co_write_compressed function that
will allow us to safely use compressed writes for the qcow2 from running
VMs.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Jeff Cody <jcody@redhat.com>
CC: Markus Armbruster <armbru@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: John Snow <jsnow@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c | 89 ++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 52 insertions(+), 37 deletions(-)

Comments

Stefan Hajnoczi May 27, 2016, 5:33 p.m. UTC | #1
On Sat, May 14, 2016 at 03:45:50PM +0300, Denis V. Lunev wrote:
> +    qemu_co_mutex_lock(&s->lock);
> +    cluster_offset = \
> +        qcow2_alloc_compressed_cluster_offset(bs, sector_num << 9, out_len);

The backslash isn't necessary for wrapping lines in C.  This kind of
thing is only necessary in languages like Python where the grammar is
whitespace sensistive.

The C compiler is happy with an arbitrary amount of whitespace
(newlines) in the middle of a statement.  The backslash in C is handled
by the preprocessor: it joins the line.  That's useful for macro
definitions where you need to tell the preprocessor that several lines
belong to one macro definition.  But it's not needed for normal C code.

> +    if (!cluster_offset) {
> +        qemu_co_mutex_unlock(&s->lock);
> +        ret = -EIO;
> +        goto fail;
> +    }
> +    cluster_offset &= s->cluster_offset_mask;
>  
> -        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
> -        ret = bdrv_pwrite(bs->file->bs, cluster_offset, out_buf, out_len);
> -        if (ret < 0) {
> -            goto fail;
> -        }
> +    ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len);
> +    qemu_co_mutex_unlock(&s->lock);
> +    if (ret < 0) {
> +        goto fail;
>      }
>  
> +    iov = (struct iovec) {
> +        .iov_base   = out_buf,
> +        .iov_len    = out_len,
> +    };
> +    qemu_iovec_init_external(&hd_qiov, &iov, 1);
> +
> +    BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
> +    ret = bdrv_co_pwritev(bs->file->bs, cluster_offset, out_len, &hd_qiov, 0);

There is a race condition here:

If the newly allocated cluster is only partially filled by compressed
data then qcow2_alloc_compressed_cluster_offset() remembers that more
bytes are still available in the cluster.  The
qcow2_alloc_compressed_cluster_offset() caller will continue filling the
same cluster.

Imagine two compressed writes running at the same time.  Write A
allocates just a few bytes so write B shares a sector with the first
write:

     Sector 1
  |AAABBBBBBBBB|

The race condition is that bdrv_co_pwritev() uses read-modify-write (a
bounce buffer).  If both requests call bdrv_co_pwritev() around the same
time then the following could happen:

     Sector 1
  |000BBBBBBBBB|

or:

     Sector 1
  |AAA000000000|

It's necessary to hold s->lock around the compressed data write to avoid
this race condition.
Pavel Butsykin May 30, 2016, 9:12 a.m. UTC | #2
On 27.05.2016 20:33, Stefan Hajnoczi wrote:
> On Sat, May 14, 2016 at 03:45:50PM +0300, Denis V. Lunev wrote:
>> +    qemu_co_mutex_lock(&s->lock);
>> +    cluster_offset = \
>> +        qcow2_alloc_compressed_cluster_offset(bs, sector_num << 9, out_len);
>
> The backslash isn't necessary for wrapping lines in C.  This kind of
> thing is only necessary in languages like Python where the grammar is
> whitespace sensistive.
>
> The C compiler is happy with an arbitrary amount of whitespace
> (newlines) in the middle of a statement.  The backslash in C is handled
> by the preprocessor: it joins the line.  That's useful for macro
> definitions where you need to tell the preprocessor that several lines
> belong to one macro definition.  But it's not needed for normal C code.
>
Thanks for the explanation, but the backslash is used more for the
person as a marker a line break. The current coding style misses this
point, but I can remove the backslash, because I don't think it's
something important :)

>> +    if (!cluster_offset) {
>> +        qemu_co_mutex_unlock(&s->lock);
>> +        ret = -EIO;
>> +        goto fail;
>> +    }
>> +    cluster_offset &= s->cluster_offset_mask;
>>
>> -        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
>> -        ret = bdrv_pwrite(bs->file->bs, cluster_offset, out_buf, out_len);
>> -        if (ret < 0) {
>> -            goto fail;
>> -        }
>> +    ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len);
>> +    qemu_co_mutex_unlock(&s->lock);
>> +    if (ret < 0) {
>> +        goto fail;
>>       }
>>
>> +    iov = (struct iovec) {
>> +        .iov_base   = out_buf,
>> +        .iov_len    = out_len,
>> +    };
>> +    qemu_iovec_init_external(&hd_qiov, &iov, 1);
>> +
>> +    BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
>> +    ret = bdrv_co_pwritev(bs->file->bs, cluster_offset, out_len, &hd_qiov, 0);
>
> There is a race condition here:
>
> If the newly allocated cluster is only partially filled by compressed
> data then qcow2_alloc_compressed_cluster_offset() remembers that more
> bytes are still available in the cluster.  The
> qcow2_alloc_compressed_cluster_offset() caller will continue filling the
> same cluster.
>
> Imagine two compressed writes running at the same time.  Write A
> allocates just a few bytes so write B shares a sector with the first
> write:
>
>       Sector 1
>    |AAABBBBBBBBB|
>
> The race condition is that bdrv_co_pwritev() uses read-modify-write (a
> bounce buffer).  If both requests call bdrv_co_pwritev() around the same
> time then the following could happen:
>
>       Sector 1
>    |000BBBBBBBBB|
>
> or:
>
>       Sector 1
>    |AAA000000000|
>
> It's necessary to hold s->lock around the compressed data write to avoid
> this race condition.
>
I agree, there is really a race.. Thank you, this is a very good point!
Pavel Butsykin May 30, 2016, 12:58 p.m. UTC | #3
On 30.05.2016 12:12, Pavel Butsykin wrote:
> On 27.05.2016 20:33, Stefan Hajnoczi wrote:
>> On Sat, May 14, 2016 at 03:45:50PM +0300, Denis V. Lunev wrote:
>>> +    qemu_co_mutex_lock(&s->lock);
>>> +    cluster_offset = \
>>> +        qcow2_alloc_compressed_cluster_offset(bs, sector_num << 9,
>>> out_len);
>>
>> The backslash isn't necessary for wrapping lines in C.  This kind of
>> thing is only necessary in languages like Python where the grammar is
>> whitespace sensistive.
>>
>> The C compiler is happy with an arbitrary amount of whitespace
>> (newlines) in the middle of a statement.  The backslash in C is handled
>> by the preprocessor: it joins the line.  That's useful for macro
>> definitions where you need to tell the preprocessor that several lines
>> belong to one macro definition.  But it's not needed for normal C code.
>>
> Thanks for the explanation, but the backslash is used more for the
> person as a marker a line break. The current coding style misses this
> point, but I can remove the backslash, because I don't think it's
> something important :)
>
>>> +    if (!cluster_offset) {
>>> +        qemu_co_mutex_unlock(&s->lock);
>>> +        ret = -EIO;
>>> +        goto fail;
>>> +    }
>>> +    cluster_offset &= s->cluster_offset_mask;
>>>
>>> -        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
>>> -        ret = bdrv_pwrite(bs->file->bs, cluster_offset, out_buf,
>>> out_len);
>>> -        if (ret < 0) {
>>> -            goto fail;
>>> -        }
>>> +    ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset,
>>> out_len);
>>> +    qemu_co_mutex_unlock(&s->lock);
>>> +    if (ret < 0) {
>>> +        goto fail;
>>>       }
>>>
>>> +    iov = (struct iovec) {
>>> +        .iov_base   = out_buf,
>>> +        .iov_len    = out_len,
>>> +    };
>>> +    qemu_iovec_init_external(&hd_qiov, &iov, 1);
>>> +
>>> +    BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
>>> +    ret = bdrv_co_pwritev(bs->file->bs, cluster_offset, out_len,
>>> &hd_qiov, 0);
>>
>> There is a race condition here:
>>
>> If the newly allocated cluster is only partially filled by compressed
>> data then qcow2_alloc_compressed_cluster_offset() remembers that more
>> bytes are still available in the cluster.  The
>> qcow2_alloc_compressed_cluster_offset() caller will continue filling the
>> same cluster.
>>
>> Imagine two compressed writes running at the same time.  Write A
>> allocates just a few bytes so write B shares a sector with the first
>> write:

Sorry, but it seems this will never happen, because the second write
will not pass this check:

uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
                                                uint64_t offset,
                                                int compressed_size)
{
     ...
     /* Compression can't overwrite anything. Fail if the cluster was 
already
      * allocated. */
     cluster_offset = be64_to_cpu(l2_table[l2_index]);
     if (cluster_offset & L2E_OFFSET_MASK) {
         qcow2_cache_put(bs, s->l2_table_cache, (void**) &l2_table);
         return 0;
     }
    ...

As you can see we can't do the compressed write in the already allocated
cluster.

>>
>>       Sector 1
>>    |AAABBBBBBBBB|
>>
>> The race condition is that bdrv_co_pwritev() uses read-modify-write (a
>> bounce buffer).  If both requests call bdrv_co_pwritev() around the same
>> time then the following could happen:
>>
>>       Sector 1
>>    |000BBBBBBBBB|
>>
>> or:
>>
>>       Sector 1
>>    |AAA000000000|
>>
>> It's necessary to hold s->lock around the compressed data write to avoid
>> this race condition.
>>
> I agree, there is really a race.. Thank you, this is a very good point!
>
>
Eric Blake May 31, 2016, 6:42 p.m. UTC | #4
On 05/30/2016 06:58 AM, Pavel Butsykin wrote:

> 
> Sorry, but it seems this will never happen, because the second write
> will not pass this check:
> 
> uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
>                                                uint64_t offset,
>                                                int compressed_size)
> {
>     ...
>     /* Compression can't overwrite anything. Fail if the cluster was
> already
>      * allocated. */
>     cluster_offset = be64_to_cpu(l2_table[l2_index]);
>     if (cluster_offset & L2E_OFFSET_MASK) {
>         qcow2_cache_put(bs, s->l2_table_cache, (void**) &l2_table);
>         return 0;
>     }
>    ...
> 
> As you can see we can't do the compressed write in the already allocated
> cluster.

Umm, doesn't that defeat the point of compression, if every compressed
cluster becomes the head of a new cluster?  The whole goal of
compression is to be able to fit multiple clusters within one.
Denis V. Lunev May 31, 2016, 9 p.m. UTC | #5
On 05/31/2016 09:42 PM, Eric Blake wrote:
> On 05/30/2016 06:58 AM, Pavel Butsykin wrote:
>
>> Sorry, but it seems this will never happen, because the second write
>> will not pass this check:
>>
>> uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
>>                                                 uint64_t offset,
>>                                                 int compressed_size)
>> {
>>      ...
>>      /* Compression can't overwrite anything. Fail if the cluster was
>> already
>>       * allocated. */
>>      cluster_offset = be64_to_cpu(l2_table[l2_index]);
>>      if (cluster_offset & L2E_OFFSET_MASK) {
>>          qcow2_cache_put(bs, s->l2_table_cache, (void**) &l2_table);
>>          return 0;
>>      }
>>     ...
>>
>> As you can see we can't do the compressed write in the already allocated
>> cluster.
> Umm, doesn't that defeat the point of compression, if every compressed
> cluster becomes the head of a new cluster?  The whole goal of
> compression is to be able to fit multiple clusters within one.
>
AFAIK the file will be sparse in that unused areas
Eric Blake May 31, 2016, 9:13 p.m. UTC | #6
On 05/31/2016 03:00 PM, Denis V. Lunev wrote:
> On 05/31/2016 09:42 PM, Eric Blake wrote:
>> On 05/30/2016 06:58 AM, Pavel Butsykin wrote:
>>
>>> Sorry, but it seems this will never happen, because the second write
>>> will not pass this check:
>>>
>>> uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
>>>                                                 uint64_t offset,
>>>                                                 int compressed_size)
>>> {
>>>      ...
>>>      /* Compression can't overwrite anything. Fail if the cluster was
>>> already
>>>       * allocated. */
>>>      cluster_offset = be64_to_cpu(l2_table[l2_index]);
>>>      if (cluster_offset & L2E_OFFSET_MASK) {
>>>          qcow2_cache_put(bs, s->l2_table_cache, (void**) &l2_table);
>>>          return 0;
>>>      }
>>>     ...
>>>
>>> As you can see we can't do the compressed write in the already allocated
>>> cluster.
>> Umm, doesn't that defeat the point of compression, if every compressed
>> cluster becomes the head of a new cluster?  The whole goal of
>> compression is to be able to fit multiple clusters within one.
>>
> AFAIK the file will be sparse in that unused areas

IIRC, on the NTFS file system, the minimum hole size is 64k. If you also
have 64k clusters, then you don't have a sparse file - every tail of
zero sectors will be explicit in the filesystem, if you are using 1:1
clusters for compression.  Other file systems may have finer granularity
for holes, but it's still rather awkward to be relying on sparseness
when a better solution is to pack compressed sectors consecutively.
Kevin Wolf June 1, 2016, 9:25 a.m. UTC | #7
Am 27.05.2016 um 19:33 hat Stefan Hajnoczi geschrieben:
> On Sat, May 14, 2016 at 03:45:50PM +0300, Denis V. Lunev wrote:
> > +    qemu_co_mutex_lock(&s->lock);
> > +    cluster_offset = \
> > +        qcow2_alloc_compressed_cluster_offset(bs, sector_num << 9, out_len);
> 
> The backslash isn't necessary for wrapping lines in C.  This kind of
> thing is only necessary in languages like Python where the grammar is
> whitespace sensistive.
> 
> The C compiler is happy with an arbitrary amount of whitespace
> (newlines) in the middle of a statement.  The backslash in C is handled
> by the preprocessor: it joins the line.  That's useful for macro
> definitions where you need to tell the preprocessor that several lines
> belong to one macro definition.  But it's not needed for normal C code.
> 
> > +    if (!cluster_offset) {
> > +        qemu_co_mutex_unlock(&s->lock);
> > +        ret = -EIO;
> > +        goto fail;
> > +    }
> > +    cluster_offset &= s->cluster_offset_mask;
> >  
> > -        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
> > -        ret = bdrv_pwrite(bs->file->bs, cluster_offset, out_buf, out_len);
> > -        if (ret < 0) {
> > -            goto fail;
> > -        }
> > +    ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len);
> > +    qemu_co_mutex_unlock(&s->lock);
> > +    if (ret < 0) {
> > +        goto fail;
> >      }
> >  
> > +    iov = (struct iovec) {
> > +        .iov_base   = out_buf,
> > +        .iov_len    = out_len,
> > +    };
> > +    qemu_iovec_init_external(&hd_qiov, &iov, 1);
> > +
> > +    BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
> > +    ret = bdrv_co_pwritev(bs->file->bs, cluster_offset, out_len, &hd_qiov, 0);
> 
> There is a race condition here:
> 
> If the newly allocated cluster is only partially filled by compressed
> data then qcow2_alloc_compressed_cluster_offset() remembers that more
> bytes are still available in the cluster.  The
> qcow2_alloc_compressed_cluster_offset() caller will continue filling the
> same cluster.
> 
> Imagine two compressed writes running at the same time.  Write A
> allocates just a few bytes so write B shares a sector with the first
> write:
> 
>      Sector 1
>   |AAABBBBBBBBB|
> 
> The race condition is that bdrv_co_pwritev() uses read-modify-write (a
> bounce buffer).  If both requests call bdrv_co_pwritev() around the same
> time then the following could happen:
> 
>      Sector 1
>   |000BBBBBBBBB|
> 
> or:
> 
>      Sector 1
>   |AAA000000000|
> 
> It's necessary to hold s->lock around the compressed data write to avoid
> this race condition.

I don't think this race condition exists. bdrv_co_pwritev() can indeed
perform RMW if the lower layer requires alignment, but that's not
something callers need to care about (which would be an awful API) but
is fully handled there by marking requests serialising if they involve
RMW.

Kevin
Kevin Wolf June 1, 2016, 9:31 a.m. UTC | #8
Am 31.05.2016 um 20:42 hat Eric Blake geschrieben:
> On 05/30/2016 06:58 AM, Pavel Butsykin wrote:
> 
> > 
> > Sorry, but it seems this will never happen, because the second write
> > will not pass this check:
> > 
> > uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
> >                                                uint64_t offset,
> >                                                int compressed_size)
> > {
> >     ...
> >     /* Compression can't overwrite anything. Fail if the cluster was
> > already
> >      * allocated. */
> >     cluster_offset = be64_to_cpu(l2_table[l2_index]);
> >     if (cluster_offset & L2E_OFFSET_MASK) {
> >         qcow2_cache_put(bs, s->l2_table_cache, (void**) &l2_table);
> >         return 0;
> >     }
> >    ...
> > 
> > As you can see we can't do the compressed write in the already allocated
> > cluster.
> 
> Umm, doesn't that defeat the point of compression, if every compressed
> cluster becomes the head of a new cluster?  The whole goal of
> compression is to be able to fit multiple clusters within one.

With compression, be careful what kind of clusters you're looking at.
The clusters in the code above are guest clusters and you can't
overwrite a guest cluster that has already been written (I'm not sure
why, though - didn't matter so far because the only writer is qemu-img
convert, which writes the whole image sequentially and doesn't split
clusters across requests, but if we want to allow compressed writes in a
running VM, we may want to change this).

However, the compressed data of multiple guest clusters may end up in
the same host cluster, and that's indeed the whole point of compression.

Kevin
Pavel Butsykin June 1, 2016, 9:53 a.m. UTC | #9
On 01.06.2016 00:13, Eric Blake wrote:
> On 05/31/2016 03:00 PM, Denis V. Lunev wrote:
>> On 05/31/2016 09:42 PM, Eric Blake wrote:
>>> On 05/30/2016 06:58 AM, Pavel Butsykin wrote:
>>>
>>>> Sorry, but it seems this will never happen, because the second write
>>>> will not pass this check:
>>>>
>>>> uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
>>>>                                                  uint64_t offset,
>>>>                                                  int compressed_size)
>>>> {
>>>>       ...
>>>>       /* Compression can't overwrite anything. Fail if the cluster was
>>>> already
>>>>        * allocated. */
>>>>       cluster_offset = be64_to_cpu(l2_table[l2_index]);
>>>>       if (cluster_offset & L2E_OFFSET_MASK) {
>>>>           qcow2_cache_put(bs, s->l2_table_cache, (void**) &l2_table);
>>>>           return 0;
>>>>       }
>>>>      ...
>>>>
>>>> As you can see we can't do the compressed write in the already allocated
>>>> cluster.
>>> Umm, doesn't that defeat the point of compression, if every compressed
>>> cluster becomes the head of a new cluster?  The whole goal of
>>> compression is to be able to fit multiple clusters within one.
>>>
>> AFAIK the file will be sparse in that unused areas
>
> IIRC, on the NTFS file system, the minimum hole size is 64k. If you also
> have 64k clusters, then you don't have a sparse file - every tail of
> zero sectors will be explicit in the filesystem, if you are using 1:1
> clusters for compression.  Other file systems may have finer granularity
> for holes, but it's still rather awkward to be relying on sparseness
> when a better solution is to pack compressed sectors consecutively.
>

Maybe you're right, I agree that the above mentioned issues are taking
place and it would be good to solve them, but it looks like the next
step. The solution that you proposed has a direct relation to the
improvement of the format driver that goes beyond the goals of this
patch-set, where the main goal is to provide opportunities compression
from the format drivers for the backup function.
Stefan Hajnoczi June 1, 2016, 8:06 p.m. UTC | #10
On Wed, Jun 01, 2016 at 11:25:57AM +0200, Kevin Wolf wrote:
> Am 27.05.2016 um 19:33 hat Stefan Hajnoczi geschrieben:
> > On Sat, May 14, 2016 at 03:45:50PM +0300, Denis V. Lunev wrote:
> > > +    qemu_co_mutex_lock(&s->lock);
> > > +    cluster_offset = \
> > > +        qcow2_alloc_compressed_cluster_offset(bs, sector_num << 9, out_len);
> > 
> > The backslash isn't necessary for wrapping lines in C.  This kind of
> > thing is only necessary in languages like Python where the grammar is
> > whitespace sensistive.
> > 
> > The C compiler is happy with an arbitrary amount of whitespace
> > (newlines) in the middle of a statement.  The backslash in C is handled
> > by the preprocessor: it joins the line.  That's useful for macro
> > definitions where you need to tell the preprocessor that several lines
> > belong to one macro definition.  But it's not needed for normal C code.
> > 
> > > +    if (!cluster_offset) {
> > > +        qemu_co_mutex_unlock(&s->lock);
> > > +        ret = -EIO;
> > > +        goto fail;
> > > +    }
> > > +    cluster_offset &= s->cluster_offset_mask;
> > >  
> > > -        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
> > > -        ret = bdrv_pwrite(bs->file->bs, cluster_offset, out_buf, out_len);
> > > -        if (ret < 0) {
> > > -            goto fail;
> > > -        }
> > > +    ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len);
> > > +    qemu_co_mutex_unlock(&s->lock);
> > > +    if (ret < 0) {
> > > +        goto fail;
> > >      }
> > >  
> > > +    iov = (struct iovec) {
> > > +        .iov_base   = out_buf,
> > > +        .iov_len    = out_len,
> > > +    };
> > > +    qemu_iovec_init_external(&hd_qiov, &iov, 1);
> > > +
> > > +    BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
> > > +    ret = bdrv_co_pwritev(bs->file->bs, cluster_offset, out_len, &hd_qiov, 0);
> > 
> > There is a race condition here:
> > 
> > If the newly allocated cluster is only partially filled by compressed
> > data then qcow2_alloc_compressed_cluster_offset() remembers that more
> > bytes are still available in the cluster.  The
> > qcow2_alloc_compressed_cluster_offset() caller will continue filling the
> > same cluster.
> > 
> > Imagine two compressed writes running at the same time.  Write A
> > allocates just a few bytes so write B shares a sector with the first
> > write:
> > 
> >      Sector 1
> >   |AAABBBBBBBBB|
> > 
> > The race condition is that bdrv_co_pwritev() uses read-modify-write (a
> > bounce buffer).  If both requests call bdrv_co_pwritev() around the same
> > time then the following could happen:
> > 
> >      Sector 1
> >   |000BBBBBBBBB|
> > 
> > or:
> > 
> >      Sector 1
> >   |AAA000000000|
> > 
> > It's necessary to hold s->lock around the compressed data write to avoid
> > this race condition.
> 
> I don't think this race condition exists. bdrv_co_pwritev() can indeed
> perform RMW if the lower layer requires alignment, but that's not
> something callers need to care about (which would be an awful API) but
> is fully handled there by marking requests serialising if they involve
> RMW.

Good point.

Stefan

Patch
diff mbox

diff --git a/block/qcow2.c b/block/qcow2.c
index 62febfc..d948d44 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2533,13 +2533,16 @@  static int qcow2_truncate(BlockDriverState *bs, int64_t offset)
 
 /* XXX: put compressed sectors first, then all the cluster aligned
    tables to avoid losing bytes in alignment */
-static int qcow2_write_compressed(BlockDriverState *bs, int64_t sector_num,
-                                  const uint8_t *buf, int nb_sectors)
+static coroutine_fn int
+qcow2_co_write_compressed(BlockDriverState *bs, int64_t sector_num,
+                          int nb_sectors, QEMUIOVector *qiov)
 {
     BDRVQcow2State *s = bs->opaque;
+    QEMUIOVector hd_qiov;
+    struct iovec iov;
     z_stream strm;
     int ret, out_len;
-    uint8_t *out_buf;
+    uint8_t *buf, *out_buf;
     uint64_t cluster_offset;
 
     if (nb_sectors == 0) {
@@ -2549,29 +2552,25 @@  static int qcow2_write_compressed(BlockDriverState *bs, int64_t sector_num,
         return bdrv_truncate(bs->file->bs, cluster_offset);
     }
 
+    buf = qemu_blockalign(bs, s->cluster_size);
     if (nb_sectors != s->cluster_sectors) {
-        ret = -EINVAL;
-
-        /* Zero-pad last write if image size is not cluster aligned */
-        if (sector_num + nb_sectors == bs->total_sectors &&
-            nb_sectors < s->cluster_sectors) {
-            uint8_t *pad_buf = qemu_blockalign(bs, s->cluster_size);
-            memset(pad_buf, 0, s->cluster_size);
-            memcpy(pad_buf, buf, nb_sectors * BDRV_SECTOR_SIZE);
-            ret = qcow2_write_compressed(bs, sector_num,
-                                         pad_buf, s->cluster_sectors);
-            qemu_vfree(pad_buf);
+        if (nb_sectors > s->cluster_sectors ||
+            sector_num + nb_sectors != bs->total_sectors)
+        {
+            qemu_vfree(buf);
+            return -EINVAL;
         }
-        return ret;
+        /* Zero-pad last write if image size is not cluster aligned */
+        memset(buf, 0, s->cluster_size);
     }
+    qemu_iovec_to_buf(qiov, 0, buf, qiov->size);
 
     out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
 
     /* best compression, small window, no zlib header */
     memset(&strm, 0, sizeof(strm));
-    ret = deflateInit2(&strm, Z_DEFAULT_COMPRESSION,
-                       Z_DEFLATED, -12,
-                       9, Z_DEFAULT_STRATEGY);
+    ret = deflateInit2(&strm, Z_DEFAULT_COMPRESSION, Z_DEFLATED,
+                       -12, 9, Z_DEFAULT_STRATEGY);
     if (ret != 0) {
         ret = -EINVAL;
         goto fail;
@@ -2593,34 +2592,50 @@  static int qcow2_write_compressed(BlockDriverState *bs, int64_t sector_num,
     deflateEnd(&strm);
 
     if (ret != Z_STREAM_END || out_len >= s->cluster_size) {
+        iov = (struct iovec) {
+            .iov_base   = buf,
+            .iov_len    = out_len,
+        };
+        qemu_iovec_init_external(&hd_qiov, &iov, 1);
         /* could not compress: write normal cluster */
-        ret = bdrv_write(bs, sector_num, buf, s->cluster_sectors);
+        ret = qcow2_co_writev(bs, sector_num, s->cluster_sectors, &hd_qiov);
         if (ret < 0) {
             goto fail;
         }
-    } else {
-        cluster_offset = qcow2_alloc_compressed_cluster_offset(bs,
-            sector_num << 9, out_len);
-        if (!cluster_offset) {
-            ret = -EIO;
-            goto fail;
-        }
-        cluster_offset &= s->cluster_offset_mask;
+        goto success;
+    }
 
-        ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len);
-        if (ret < 0) {
-            goto fail;
-        }
+    qemu_co_mutex_lock(&s->lock);
+    cluster_offset = \
+        qcow2_alloc_compressed_cluster_offset(bs, sector_num << 9, out_len);
+    if (!cluster_offset) {
+        qemu_co_mutex_unlock(&s->lock);
+        ret = -EIO;
+        goto fail;
+    }
+    cluster_offset &= s->cluster_offset_mask;
 
-        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
-        ret = bdrv_pwrite(bs->file->bs, cluster_offset, out_buf, out_len);
-        if (ret < 0) {
-            goto fail;
-        }
+    ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len);
+    qemu_co_mutex_unlock(&s->lock);
+    if (ret < 0) {
+        goto fail;
     }
 
+    iov = (struct iovec) {
+        .iov_base   = out_buf,
+        .iov_len    = out_len,
+    };
+    qemu_iovec_init_external(&hd_qiov, &iov, 1);
+
+    BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
+    ret = bdrv_co_pwritev(bs->file->bs, cluster_offset, out_len, &hd_qiov, 0);
+    if (ret < 0) {
+        goto fail;
+    }
+success:
     ret = 0;
 fail:
+    qemu_vfree(buf);
     g_free(out_buf);
     return ret;
 }
@@ -3382,7 +3397,7 @@  BlockDriver bdrv_qcow2 = {
     .bdrv_co_write_zeroes   = qcow2_co_write_zeroes,
     .bdrv_co_discard        = qcow2_co_discard,
     .bdrv_truncate          = qcow2_truncate,
-    .bdrv_write_compressed  = qcow2_write_compressed,
+    .bdrv_co_write_compressed  = qcow2_co_write_compressed,
     .bdrv_make_empty        = qcow2_make_empty,
 
     .bdrv_snapshot_create   = qcow2_snapshot_create,