diff mbox

[8/9] mirror: use synch scheme for drive mirror

Message ID 1465917916-22348-9-git-send-email-den@openvz.org (mailing list archive)
State New, archived
Headers show

Commit Message

Denis V. Lunev June 14, 2016, 3:25 p.m. UTC
Block commit of the active image to the backing store on a slow disk
could never end. For example with the guest with the following loop
inside
    while true; do
        dd bs=1k count=1 if=/dev/zero of=x
    done
running above slow storage could not complete the operation with a
resonable amount of time:
    virsh blockcommit rhel7 sda --active --shallow
    virsh qemu-monitor-event
    virsh qemu-monitor-command rhel7 \
        '{"execute":"block-job-complete",\
          "arguments":{"device":"drive-scsi0-0-0-0"} }'
    virsh qemu-monitor-event
Completion event is never received.

This problem could not be fixed easily with the current architecture. We
should either prohibit guest writes (making dirty bitmap dirty) or switch
to the sycnchronous scheme.

This patch implements the latter. It adds mirror_before_write_notify
callback. In this case all data written from the guest is synchnonously
written to the mirror target. Though the problem is solved partially.
We should switch from bdrv_dirty_bitmap to simple hbitmap. This will be
done in the next patch.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

Comments

Eric Blake June 15, 2016, 4:18 a.m. UTC | #1
On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
> Block commit of the active image to the backing store on a slow disk
> could never end. For example with the guest with the following loop
> inside
>     while true; do
>         dd bs=1k count=1 if=/dev/zero of=x
>     done
> running above slow storage could not complete the operation with a

s/with/within/

> resonable amount of time:

s/resonable/reasonable/

>     virsh blockcommit rhel7 sda --active --shallow
>     virsh qemu-monitor-event
>     virsh qemu-monitor-command rhel7 \
>         '{"execute":"block-job-complete",\
>           "arguments":{"device":"drive-scsi0-0-0-0"} }'
>     virsh qemu-monitor-event
> Completion event is never received.
> 
> This problem could not be fixed easily with the current architecture. We
> should either prohibit guest writes (making dirty bitmap dirty) or switch
> to the sycnchronous scheme.

s/sycnchronous/synchronous/

> 
> This patch implements the latter. It adds mirror_before_write_notify
> callback. In this case all data written from the guest is synchnonously

s/synchnonously/synchronously/

> written to the mirror target. Though the problem is solved partially.
> We should switch from bdrv_dirty_bitmap to simple hbitmap. This will be
> done in the next patch.
> 

In other words, the mere act of mirroring a guest will now be
guest-visible in that the guest is auto-throttled while waiting for the
mirroring to be written out.  It seems like you would want to be able to
opt in or out of this scheme.  Is it something that can be toggled
mid-operation (try asynchronous, and switch to synchronous if a timeout
elapses)?

> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 78 insertions(+)
> 

I'll leave the actual idea to others to review, because there may be
some ramifications that I'm not thinking of.
Denis V. Lunev June 15, 2016, 8:52 a.m. UTC | #2
On 06/15/2016 07:18 AM, Eric Blake wrote:
> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>> Block commit of the active image to the backing store on a slow disk
>> could never end. For example with the guest with the following loop
>> inside
>>      while true; do
>>          dd bs=1k count=1 if=/dev/zero of=x
>>      done
>> running above slow storage could not complete the operation with a
> s/with/within/
>
>> resonable amount of time:
> s/resonable/reasonable/
>
>>      virsh blockcommit rhel7 sda --active --shallow
>>      virsh qemu-monitor-event
>>      virsh qemu-monitor-command rhel7 \
>>          '{"execute":"block-job-complete",\
>>            "arguments":{"device":"drive-scsi0-0-0-0"} }'
>>      virsh qemu-monitor-event
>> Completion event is never received.
>>
>> This problem could not be fixed easily with the current architecture. We
>> should either prohibit guest writes (making dirty bitmap dirty) or switch
>> to the sycnchronous scheme.
> s/sycnchronous/synchronous/
>
>> This patch implements the latter. It adds mirror_before_write_notify
>> callback. In this case all data written from the guest is synchnonously
> s/synchnonously/synchronously/
>
>> written to the mirror target. Though the problem is solved partially.
>> We should switch from bdrv_dirty_bitmap to simple hbitmap. This will be
>> done in the next patch.
>>
> In other words, the mere act of mirroring a guest will now be
> guest-visible in that the guest is auto-throttled while waiting for the
> mirroring to be written out.  It seems like you would want to be able to
> opt in or out of this scheme.  Is it something that can be toggled
> mid-operation (try asynchronous, and switch to synchronous if a timeout
> elapses)?
>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <famz@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Jeff Cody <jcody@redhat.com>
>> CC: Eric Blake <eblake@redhat.com>
>> ---
>>   block/mirror.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 78 insertions(+)
>>
> I'll leave the actual idea to others to review, because there may be
> some ramifications that I'm not thinking of.
>
I would like to start the discussion with this series.
Yes, may be we need the policy and should switch
to synch scheme after the first stage of mirroring
(when 'complete' command is sent by the management
layer.

This could be done relatively easily on the base of this
patches. Really. Though I want to obtain some general
acceptance in advance.

Den

P.S. Thank you very much for looking at this ;)
Stefan Hajnoczi June 15, 2016, 9:48 a.m. UTC | #3
On Tue, Jun 14, 2016 at 06:25:15PM +0300, Denis V. Lunev wrote:
> Block commit of the active image to the backing store on a slow disk
> could never end. For example with the guest with the following loop
> inside
>     while true; do
>         dd bs=1k count=1 if=/dev/zero of=x
>     done
> running above slow storage could not complete the operation with a
> resonable amount of time:
>     virsh blockcommit rhel7 sda --active --shallow
>     virsh qemu-monitor-event
>     virsh qemu-monitor-command rhel7 \
>         '{"execute":"block-job-complete",\
>           "arguments":{"device":"drive-scsi0-0-0-0"} }'
>     virsh qemu-monitor-event
> Completion event is never received.
> 
> This problem could not be fixed easily with the current architecture. We
> should either prohibit guest writes (making dirty bitmap dirty) or switch
> to the sycnchronous scheme.
> 
> This patch implements the latter. It adds mirror_before_write_notify
> callback. In this case all data written from the guest is synchnonously
> written to the mirror target. Though the problem is solved partially.
> We should switch from bdrv_dirty_bitmap to simple hbitmap. This will be
> done in the next patch.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 78 insertions(+)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 7471211..086256c 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -58,6 +58,9 @@ typedef struct MirrorBlockJob {
>      QSIMPLEQ_HEAD(, MirrorBuffer) buf_free;
>      int buf_free_count;
>  
> +    NotifierWithReturn before_write;
> +    CoQueue dependent_writes;
> +
>      unsigned long *in_flight_bitmap;
>      int in_flight;
>      int sectors_in_flight;
> @@ -125,6 +128,7 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
>      g_free(op->buf);

qemu_vfree() must be used for qemu_blockalign() memory.

>      g_free(op);
>  
> +    qemu_co_queue_restart_all(&s->dependent_writes);
>      if (s->waiting_for_io) {
>          qemu_coroutine_enter(s->common.co, NULL);
>      }
> @@ -511,6 +515,74 @@ static void mirror_exit(BlockJob *job, void *opaque)
>      bdrv_unref(src);
>  }
>  
> +static int coroutine_fn mirror_before_write_notify(
> +        NotifierWithReturn *notifier, void *opaque)
> +{
> +    MirrorBlockJob *s = container_of(notifier, MirrorBlockJob, before_write);
> +    BdrvTrackedRequest *req = opaque;
> +    MirrorOp *op;
> +    int sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
> +    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
> +    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
> +    int64_t end_sector = sector_num + nb_sectors;
> +    int64_t aligned_start, aligned_end;
> +
> +    if (req->type != BDRV_TRACKED_DISCARD && req->type != BDRV_TRACKED_WRITE) {
> +        /* this is not discard and write, we do not care */
> +        return 0;
> +    }
> +
> +    while (1) {
> +        bool waited = false;
> +        int64_t sn;
> +
> +        for (sn = sector_num; sn < end_sector; sn += sectors_per_chunk) {
> +            int64_t chunk = sn / sectors_per_chunk;
> +            if (test_bit(chunk, s->in_flight_bitmap)) {
> +                trace_mirror_yield_in_flight(s, chunk, s->in_flight);
> +                qemu_co_queue_wait(&s->dependent_writes);
> +                waited = true;
> +            }
> +        }
> +
> +        if (!waited) {
> +            break;
> +        }
> +    }
> +
> +    aligned_start = QEMU_ALIGN_UP(sector_num, sectors_per_chunk);
> +    aligned_end = QEMU_ALIGN_DOWN(sector_num + nb_sectors, sectors_per_chunk);
> +    if (aligned_end > aligned_start) {
> +        bdrv_reset_dirty_bitmap(s->dirty_bitmap, aligned_start,
> +                                aligned_end - aligned_start);
> +    }
> +
> +    if (req->type == BDRV_TRACKED_DISCARD) {
> +        mirror_do_zero_or_discard(s, sector_num, nb_sectors, true);
> +        return 0;
> +    }
> +
> +    s->in_flight++;
> +    s->sectors_in_flight += nb_sectors;
> +
> +    /* Allocate a MirrorOp that is used as an AIO callback.  */
> +    op = g_new(MirrorOp, 1);
> +    op->s = s;
> +    op->sector_num = sector_num;
> +    op->nb_sectors = nb_sectors;
> +    op->buf = qemu_try_blockalign(blk_bs(s->target), req->qiov->size);
> +    if (op->buf == NULL) {
> +        g_free(op);
> +        return -ENOMEM;
> +    }
> +    qemu_iovec_init(&op->qiov, req->qiov->niov);
> +    qemu_iovec_clone(&op->qiov, req->qiov, op->buf);

Now op->qiov's iovec[] array is equivalent to req->qiov but points to
op->buf.  But you never copied the data from req->qiov to op->buf so
junk gets written to the target!

> +    blk_aio_pwritev(s->target, req->offset, &op->qiov, 0,
> +                    mirror_write_complete, op);
> +    return 0;
> +}

The commit message and description claims this is synchronous but it is
not.  Async requests are being generated by guest I/O.  There is no rate
limiting if s->target is slower than bs.  In that case the queued AIO
requests keep getting longer (including the bounce buffers).  The guest
will exhaust host memory or aio functions will fail (i.e. Linux AIO max
requests is reached).

If you want this to be synchronous you have to yield the coroutine until
the request completes.  Synchronous writes increase latency so this
cannot be the new default.

A different solution is to detect when the dirty bitmap reaches a
minimum threshold and then employ I/O throttling on bs.  That way the
guest experiences no vcpu/network downtime and the I/O performance only
drops during the convergence phase.

Stefan
diff mbox

Patch

diff --git a/block/mirror.c b/block/mirror.c
index 7471211..086256c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -58,6 +58,9 @@  typedef struct MirrorBlockJob {
     QSIMPLEQ_HEAD(, MirrorBuffer) buf_free;
     int buf_free_count;
 
+    NotifierWithReturn before_write;
+    CoQueue dependent_writes;
+
     unsigned long *in_flight_bitmap;
     int in_flight;
     int sectors_in_flight;
@@ -125,6 +128,7 @@  static void mirror_iteration_done(MirrorOp *op, int ret)
     g_free(op->buf);
     g_free(op);
 
+    qemu_co_queue_restart_all(&s->dependent_writes);
     if (s->waiting_for_io) {
         qemu_coroutine_enter(s->common.co, NULL);
     }
@@ -511,6 +515,74 @@  static void mirror_exit(BlockJob *job, void *opaque)
     bdrv_unref(src);
 }
 
+static int coroutine_fn mirror_before_write_notify(
+        NotifierWithReturn *notifier, void *opaque)
+{
+    MirrorBlockJob *s = container_of(notifier, MirrorBlockJob, before_write);
+    BdrvTrackedRequest *req = opaque;
+    MirrorOp *op;
+    int sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
+    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
+    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
+    int64_t end_sector = sector_num + nb_sectors;
+    int64_t aligned_start, aligned_end;
+
+    if (req->type != BDRV_TRACKED_DISCARD && req->type != BDRV_TRACKED_WRITE) {
+        /* this is not discard and write, we do not care */
+        return 0;
+    }
+
+    while (1) {
+        bool waited = false;
+        int64_t sn;
+
+        for (sn = sector_num; sn < end_sector; sn += sectors_per_chunk) {
+            int64_t chunk = sn / sectors_per_chunk;
+            if (test_bit(chunk, s->in_flight_bitmap)) {
+                trace_mirror_yield_in_flight(s, chunk, s->in_flight);
+                qemu_co_queue_wait(&s->dependent_writes);
+                waited = true;
+            }
+        }
+
+        if (!waited) {
+            break;
+        }
+    }
+
+    aligned_start = QEMU_ALIGN_UP(sector_num, sectors_per_chunk);
+    aligned_end = QEMU_ALIGN_DOWN(sector_num + nb_sectors, sectors_per_chunk);
+    if (aligned_end > aligned_start) {
+        bdrv_reset_dirty_bitmap(s->dirty_bitmap, aligned_start,
+                                aligned_end - aligned_start);
+    }
+
+    if (req->type == BDRV_TRACKED_DISCARD) {
+        mirror_do_zero_or_discard(s, sector_num, nb_sectors, true);
+        return 0;
+    }
+
+    s->in_flight++;
+    s->sectors_in_flight += nb_sectors;
+
+    /* Allocate a MirrorOp that is used as an AIO callback.  */
+    op = g_new(MirrorOp, 1);
+    op->s = s;
+    op->sector_num = sector_num;
+    op->nb_sectors = nb_sectors;
+    op->buf = qemu_try_blockalign(blk_bs(s->target), req->qiov->size);
+    if (op->buf == NULL) {
+        g_free(op);
+        return -ENOMEM;
+    }
+    qemu_iovec_init(&op->qiov, req->qiov->niov);
+    qemu_iovec_clone(&op->qiov, req->qiov, op->buf);
+
+    blk_aio_pwritev(s->target, req->offset, &op->qiov, 0,
+                    mirror_write_complete, op);
+    return 0;
+}
+
 static int mirror_dirty_init(MirrorBlockJob *s)
 {
     int64_t sector_num, end;
@@ -764,6 +836,8 @@  immediate_exit:
         mirror_drain(s);
     }
 
+    notifier_with_return_remove(&s->before_write);
+
     assert(s->in_flight == 0);
     qemu_vfree(s->buf);
     g_free(s->cow_bitmap);
@@ -905,6 +979,10 @@  static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
         return;
     }
 
+    qemu_co_queue_init(&s->dependent_writes);
+    s->before_write.notify = mirror_before_write_notify;
+    bdrv_add_before_write_notifier(bs, &s->before_write);
+
     bdrv_op_block_all(target, s->common.blocker);
 
     s->common.co = qemu_coroutine_create(mirror_run);