[4/9] mirror: efficiently zero out target

Message ID	1465917916-22348-5-git-send-email-den@openvz.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: "Denis V. Lunev" <den@openvz.org> To: qemu-devel@nongnu.org, qemu-block@nongnu.org Date: Tue, 14 Jun 2016 18:25:11 +0300 Message-Id: <1465917916-22348-5-git-send-email-den@openvz.org> In-Reply-To: <1465917916-22348-1-git-send-email-den@openvz.org> References: <1465917916-22348-1-git-send-email-den@openvz.org> Subject: [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, vsementsov@virtuozzo.com, Fam Zheng <famz@redhat.com>, Jeff Cody <jcody@redhat.com>, Max Reitz <mreitz@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, "Denis V. Lunev" <den@openvz.org> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Denis V. Lunev June 14, 2016, 3:25 p.m. UTC

With a bdrv_co_write_zeroes method on a target BDS zeroes will not be placed
into the wire. Thus the target could be very efficiently zeroed out. This
is should be done with the largest chunk possible.

This improves the performance of the live migration of the empty disk by
150 times if NBD supports write_zeroes.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

Eric Blake June 15, 2016, 3 a.m. UTC | #1

On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be placed
> into the wire. Thus the target could be very efficiently zeroed out. This
> is should be done with the largest chunk possible.
> 
> This improves the performance of the live migration of the empty disk by
> 150 times if NBD supports write_zeroes.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index c7b3639..c2f8773 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -21,6 +21,7 @@
>  #include "qemu/ratelimit.h"
>  #include "qemu/bitmap.h"
>  
> +#define MIRROR_ZERO_CHUNK   (3u << (29 - BDRV_SECTOR_BITS))      /* 1.5 Gb */

Probably nicer to track this in bytes.  And do you really want a
hard-coded arbitrary limit, or is it better to live with
MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)?

> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>  
>      end = s->bdev_length / BDRV_SECTOR_SIZE;
>  
> -    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
> +    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
> +            target_bs->drv->bdrv_co_write_zeroes == NULL) {

Indentation is off, although if checkpatch.pl doesn't complain I guess
it doesn't matter that much.

Why should you care whether the target_bs->drv implements a callback?
Can't you just rely on the normal bdrv_*() functions to do the dirty
work of picking the most efficient implementation without you having to
bypass the block layer?  In fact, isn't that the whole goal of
bdrv_make_zero() - why not call that instead of reimplementing it?

Patch needs rebasing - we've redone this into bdrv_co_pwrite_zeroes and
a byte interface, since upstream commit c1499a5e.

>          bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);
>          return 0;
>      }
> @@ -546,6 +548,34 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>          }
>          sector_num += n;
>      }
> +
> +    if (base != NULL || bdrv_has_zero_init(target_bs)) {

You're now repeating the conditional that used to be 'bool
mark_all_dirty' (well, this is !mark_all_dirty); is it worth keeping the
simpler bool around?

> +        /* no need to zero out entire disk */
> +        return 0;
> +    }
> +
> +    for (sector_num = 0; sector_num < end; ) {
> +        int nb_sectors = MIN(MIRROR_ZERO_CHUNK, end - sector_num);

Why limit yourself to 1.5G? It's either too small for what you can
really do, or too large for what the device permits.  See my above
comment about MIN_NON_ZERO.

> +        int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +
> +        if (now - last_pause_ns > SLICE_TIME) {
> +            last_pause_ns = now;
> +            block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, 0);
> +        }
> +
> +        if (block_job_is_cancelled(&s->common)) {
> +            return -EINTR;
> +        }
> +
> +        if (s->in_flight == MAX_IN_FLIGHT) {
> +            trace_mirror_yield(s, s->in_flight, s->buf_free_count, -1);
> +            mirror_wait_for_io(s);
> +            continue;
> +        }

Hmm - I guess your mirror yield points are why you couldn't just
directly use bdrv_make_zero(); but is that something where some code
refactoring can share more code rather than duplicating it?

> +
> +        mirror_do_zero_or_discard(s, sector_num, nb_sectors, false);
> +        sector_num += nb_sectors;
> +    }
>      return 0;
>  }
>  
>

Denis V. Lunev June 15, 2016, 8:46 a.m. UTC | #2

On 06/15/2016 06:00 AM, Eric Blake wrote:
> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be placed
>> into the wire. Thus the target could be very efficiently zeroed out. This
>> is should be done with the largest chunk possible.
>>
>> This improves the performance of the live migration of the empty disk by
>> 150 times if NBD supports write_zeroes.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <famz@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Jeff Cody <jcody@redhat.com>
>> CC: Eric Blake <eblake@redhat.com>
>> ---
>>   block/mirror.c | 32 +++++++++++++++++++++++++++++++-
>>   1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index c7b3639..c2f8773 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -21,6 +21,7 @@
>>   #include "qemu/ratelimit.h"
>>   #include "qemu/bitmap.h"
>>   
>> +#define MIRROR_ZERO_CHUNK   (3u << (29 - BDRV_SECTOR_BITS))      /* 1.5 Gb */
> Probably nicer to track this in bytes.  And do you really want a
> hard-coded arbitrary limit, or is it better to live with
> MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)?
unfortunately we should. INT_MAX is not aligned as required.
May be we should align INT_MAX properly to fullfill
write_zeroes alignment.

Hmm, may be we can align INT_MAX properly down. OK,
I'll try to do that gracefully.

>> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>   
>>       end = s->bdev_length / BDRV_SECTOR_SIZE;
>>   
>> -    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
>> +    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
>> +            target_bs->drv->bdrv_co_write_zeroes == NULL) {
> Indentation is off, although if checkpatch.pl doesn't complain I guess
> it doesn't matter that much.
>
> Why should you care whether the target_bs->drv implements a callback?
> Can't you just rely on the normal bdrv_*() functions to do the dirty
> work of picking the most efficient implementation without you having to
> bypass the block layer?  In fact, isn't that the whole goal of
> bdrv_make_zero() - why not call that instead of reimplementing it?
this is the idea of the patch actually. If the callback is not 
implemented, we
will have zeroes actually written or send to the wire. In this case there is
not much sense to do that, the amount of data actually written will be
significantly increased (some areas will be written twice - with zeroes and
with the actual data).

On the other hand, if callback is implemented, we will have very small 
amount
of data in the wire and written actually and thus will have a benefit. I am
trying to avoid very small chunks of data. Here (during the migration 
process)
the data is sent with 10 Mb chunks and with takes a LOT of time with NBD.
We can send chunks 1.5 Gb (currently). They occupies the same 26 bytes 
of data
on the transport layer.

> Patch needs rebasing - we've redone this into bdrv_co_pwrite_zeroes and
> a byte interface, since upstream commit c1499a5e.
sure!

>>           bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);
>>           return 0;
>>       }
>> @@ -546,6 +548,34 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>           }
>>           sector_num += n;
>>       }
>> +
>> +    if (base != NULL || bdrv_has_zero_init(target_bs)) {
> You're now repeating the conditional that used to be 'bool
> mark_all_dirty' (well, this is !mark_all_dirty); is it worth keeping the
> simpler bool around?
not quite. The difference is in the presence of the callback,
but sure I can cache it. no prob.

>> +        /* no need to zero out entire disk */
>> +        return 0;
>> +    }
>> +
>> +    for (sector_num = 0; sector_num < end; ) {
>> +        int nb_sectors = MIN(MIRROR_ZERO_CHUNK, end - sector_num);
> Why limit yourself to 1.5G? It's either too small for what you can
> really do, or too large for what the device permits.  See my above
> comment about MIN_NON_ZERO.
alignment, covered above

>> +        int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>> +
>> +        if (now - last_pause_ns > SLICE_TIME) {
>> +            last_pause_ns = now;
>> +            block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, 0);
>> +        }
>> +
>> +        if (block_job_is_cancelled(&s->common)) {
>> +            return -EINTR;
>> +        }
>> +
>> +        if (s->in_flight == MAX_IN_FLIGHT) {
>> +            trace_mirror_yield(s, s->in_flight, s->buf_free_count, -1);
>> +            mirror_wait_for_io(s);
>> +            continue;
>> +        }
> Hmm - I guess your mirror yield points are why you couldn't just
> directly use bdrv_make_zero(); but is that something where some code
> refactoring can share more code rather than duplicating it?
the purpose is to put several requests into the wire in parallel.
Original mirror code do this nicely and thus is reused.

>> +
>> +        mirror_do_zero_or_discard(s, sector_num, nb_sectors, false);
>> +        sector_num += nb_sectors;
>> +    }
>>       return 0;
>>   }
>>   
>>

Eric Blake June 15, 2016, 12:34 p.m. UTC | #3

On 06/15/2016 02:46 AM, Denis V. Lunev wrote:
> On 06/15/2016 06:00 AM, Eric Blake wrote:
>> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>>> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be
>>> placed
>>> into the wire. Thus the target could be very efficiently zeroed out.
>>> This
>>> is should be done with the largest chunk possible.
>>>

>> Probably nicer to track this in bytes.  And do you really want a
>> hard-coded arbitrary limit, or is it better to live with
>> MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)?
> unfortunately we should. INT_MAX is not aligned as required.
> May be we should align INT_MAX properly to fullfill
> write_zeroes alignment.
> 
> Hmm, may be we can align INT_MAX properly down. OK,
> I'll try to do that gracefully.

It's fairly easy to round a max_transfer or max_pwrite_zeroes down to an
aligned value; we already have code in io.c that does that in
bdrv_co_do_pwrite_zeroes().

> 
>>> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>>         end = s->bdev_length / BDRV_SECTOR_SIZE;
>>>   -    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
>>> +    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
>>> +            target_bs->drv->bdrv_co_write_zeroes == NULL) {
>> Indentation is off, although if checkpatch.pl doesn't complain I guess
>> it doesn't matter that much.
>>
>> Why should you care whether the target_bs->drv implements a callback?
>> Can't you just rely on the normal bdrv_*() functions to do the dirty
>> work of picking the most efficient implementation without you having to
>> bypass the block layer?  In fact, isn't that the whole goal of
>> bdrv_make_zero() - why not call that instead of reimplementing it?
> this is the idea of the patch actually. If the callback is not
> implemented, we
> will have zeroes actually written or send to the wire. In this case
> there is
> not much sense to do that, the amount of data actually written will be
> significantly increased (some areas will be written twice - with zeroes and
> with the actual data).
>

But that's where bdrv_can_write_zeroes_with_unmap() comes in handy - you
can use the public interface to learn whether bdrv_make_zero() will be
efficient or not, without having to probe what the backend supports.

> On the other hand, if callback is implemented, we will have very small
> amount
> of data in the wire and written actually and thus will have a benefit. I am
> trying to avoid very small chunks of data. Here (during the migration
> process)
> the data is sent with 10 Mb chunks and with takes a LOT of time with NBD.
> We can send chunks 1.5 Gb (currently). They occupies the same 26 bytes
> of data
> on the transport layer.

I agree that we don't want to pre-initialize the device to zero unless
write zeroes is an efficient operation, but I don't think that the
existence of bs->drv->bdrv_co_[p]write_zeroes is the right way to find
that out.

I also think that we need to push harder on the NBD list that under the
new block limits proposal, we WANT to be able to advertise when the new
NBD_CMD_WRITE_ZEROES command will accept a larger size than
NBD_CMD_WRITE (as currently written, the BLOCK_INFO extension proposal
states that if a server advertises a max transaction size to the client,
then the client must honor that size for all commands including
NBD_CMD_WRITE_ZEROES, which would mean your 1.5G request [or my proposed
2G - 4k request] is invalid and would have to be a bunch of 32M requests).
https://sourceforge.net/p/nbd/mailman/message/35081223/

Denis V. Lunev June 15, 2016, 1:18 p.m. UTC | #4

On 06/15/2016 03:34 PM, Eric Blake wrote:
> On 06/15/2016 02:46 AM, Denis V. Lunev wrote:
>> On 06/15/2016 06:00 AM, Eric Blake wrote:
>>> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>>>> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be
>>>> placed
>>>> into the wire. Thus the target could be very efficiently zeroed out.
>>>> This
>>>> is should be done with the largest chunk possible.
>>>>
>>> Probably nicer to track this in bytes.  And do you really want a
>>> hard-coded arbitrary limit, or is it better to live with
>>> MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)?
>> unfortunately we should. INT_MAX is not aligned as required.
>> May be we should align INT_MAX properly to fullfill
>> write_zeroes alignment.
>>
>> Hmm, may be we can align INT_MAX properly down. OK,
>> I'll try to do that gracefully.
> It's fairly easy to round a max_transfer or max_pwrite_zeroes down to an
> aligned value; we already have code in io.c that does that in
> bdrv_co_do_pwrite_zeroes().

ok



>>>> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>>>          end = s->bdev_length / BDRV_SECTOR_SIZE;
>>>>    -    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
>>>> +    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
>>>> +            target_bs->drv->bdrv_co_write_zeroes == NULL) {
>>> Indentation is off, although if checkpatch.pl doesn't complain I guess
>>> it doesn't matter that much.
>>>
>>> Why should you care whether the target_bs->drv implements a callback?
>>> Can't you just rely on the normal bdrv_*() functions to do the dirty
>>> work of picking the most efficient implementation without you having to
>>> bypass the block layer?  In fact, isn't that the whole goal of
>>> bdrv_make_zero() - why not call that instead of reimplementing it?
>> this is the idea of the patch actually. If the callback is not
>> implemented, we
>> will have zeroes actually written or send to the wire. In this case
>> there is
>> not much sense to do that, the amount of data actually written will be
>> significantly increased (some areas will be written twice - with zeroes and
>> with the actual data).
>>
> But that's where bdrv_can_write_zeroes_with_unmap() comes in handy - you
> can use the public interface to learn whether bdrv_make_zero() will be
> efficient or not, without having to probe what the backend supports.

bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs)
{
     BlockDriverInfo bdi;

     if (bs->backing || !(bs->open_flags & BDRV_O_UNMAP)) {
         return false;
     }

     if (bdrv_get_info(bs, &bdi) == 0) {
         return bdi.can_write_zeroes_with_unmap;
     }

     return false;
}


This function looks rotten. We CAN efficiently zero out
QCOW2 images even with backing store available. Though
the availability of the bdrv_co_write_zeroes does not
guarantee that it is working (NFS, CIFS etc for raw_posix.c).




>> On the other hand, if callback is implemented, we will have very small
>> amount
>> of data in the wire and written actually and thus will have a benefit. I am
>> trying to avoid very small chunks of data. Here (during the migration
>> process)
>> the data is sent with 10 Mb chunks and with takes a LOT of time with NBD.
>> We can send chunks 1.5 Gb (currently). They occupies the same 26 bytes
>> of data
>> on the transport layer.
> I agree that we don't want to pre-initialize the device to zero unless
> write zeroes is an efficient operation, but I don't think that the
> existence of bs->drv->bdrv_co_[p]write_zeroes is the right way to find
> that out.
>
> I also think that we need to push harder on the NBD list that under the
> new block limits proposal, we WANT to be able to advertise when the new
> NBD_CMD_WRITE_ZEROES command will accept a larger size than
> NBD_CMD_WRITE (as currently written, the BLOCK_INFO extension proposal
> states that if a server advertises a max transaction size to the client,
> then the client must honor that size for all commands including
> NBD_CMD_WRITE_ZEROES, which would mean your 1.5G request [or my proposed
> 2G - 4k request] is invalid and would have to be a bunch of 32M requests).
> https://sourceforge.net/p/nbd/mailman/message/35081223/
>


I see...

Denis V. Lunev July 6, 2016, 2:33 p.m. UTC | #5

On 06/15/2016 03:34 PM, Eric Blake wrote:
> On 06/15/2016 02:46 AM, Denis V. Lunev wrote:
>> On 06/15/2016 06:00 AM, Eric Blake wrote:
>>> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>>>> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be
>>>> placed
>>>> into the wire. Thus the target could be very efficiently zeroed out.
>>>> This
>>>> is should be done with the largest chunk possible.
>>>>
>>> Probably nicer to track this in bytes.  And do you really want a
>>> hard-coded arbitrary limit, or is it better to live with
>>> MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)?
>> unfortunately we should. INT_MAX is not aligned as required.
>> May be we should align INT_MAX properly to fullfill
>> write_zeroes alignment.
>>
>> Hmm, may be we can align INT_MAX properly down. OK,
>> I'll try to do that gracefully.
> It's fairly easy to round a max_transfer or max_pwrite_zeroes down to an
> aligned value; we already have code in io.c that does that in
> bdrv_co_do_pwrite_zeroes().
>
>>>> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>>>          end = s->bdev_length / BDRV_SECTOR_SIZE;
>>>>    -    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
>>>> +    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
>>>> +            target_bs->drv->bdrv_co_write_zeroes == NULL) {
>>> Indentation is off, although if checkpatch.pl doesn't complain I guess
>>> it doesn't matter that much.
>>>
>>> Why should you care whether the target_bs->drv implements a callback?
>>> Can't you just rely on the normal bdrv_*() functions to do the dirty
>>> work of picking the most efficient implementation without you having to
>>> bypass the block layer?  In fact, isn't that the whole goal of
>>> bdrv_make_zero() - why not call that instead of reimplementing it?
>> this is the idea of the patch actually. If the callback is not
>> implemented, we
>> will have zeroes actually written or send to the wire. In this case
>> there is
>> not much sense to do that, the amount of data actually written will be
>> significantly increased (some areas will be written twice - with zeroes and
>> with the actual data).
>>
> But that's where bdrv_can_write_zeroes_with_unmap() comes in handy - you
> can use the public interface to learn whether bdrv_make_zero() will be
> efficient or not, without having to probe what the backend supports.
>
>> On the other hand, if callback is implemented, we will have very small
>> amount
>> of data in the wire and written actually and thus will have a benefit. I am
>> trying to avoid very small chunks of data. Here (during the migration
>> process)
>> the data is sent with 10 Mb chunks and with takes a LOT of time with NBD.
>> We can send chunks 1.5 Gb (currently). They occupies the same 26 bytes
>> of data
>> on the transport layer.
> I agree that we don't want to pre-initialize the device to zero unless
> write zeroes is an efficient operation, but I don't think that the
> existence of bs->drv->bdrv_co_[p]write_zeroes is the right way to find
> that out.
>
> I also think that we need to push harder on the NBD list that under the
> new block limits proposal, we WANT to be able to advertise when the new
> NBD_CMD_WRITE_ZEROES command will accept a larger size than
> NBD_CMD_WRITE (as currently written, the BLOCK_INFO extension proposal
> states that if a server advertises a max transaction size to the client,
> then the client must honor that size for all commands including
> NBD_CMD_WRITE_ZEROES, which would mean your 1.5G request [or my proposed
> 2G - 4k request] is invalid and would have to be a bunch of 32M requests).
> https://sourceforge.net/p/nbd/mailman/message/35081223/
>

Eric, do you know how to make an answer to this letter in the list.
May be you can write a reply and add me in CC: there.
This change in NBD is really necessary.

Den

[4/9] mirror: efficiently zero out target

Commit Message

Comments

Patch