[7/6] block/io: improve loadvm performance

Message ID	20200622083303.18665-1-den@openvz.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=prd8=AD=nongnu.org=qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C4BA6206C3 From: "Denis V. Lunev" <den@openvz.org> To: qemu-block@nongnu.org, qemu-devel@nongnu.org Subject: [PATCH 7/6] block/io: improve loadvm performance Date: Mon, 22 Jun 2020 11:33:03 +0300 Message-Id: <20200622083303.18665-1-den@openvz.org> In-Reply-To: <20200619100708.30440-1-den@openvz.org> References: <20200619100708.30440-1-den@openvz.org> Received-SPF: pass client-ip=185.231.240.75; envelope-from=den@openvz.org; helo=relay3.sw.ru Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, Fam Zheng <fam@euphon.net>, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>, Juan Quintela <quintela@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Max Reitz <mreitz@redhat.com>, Denis Plotnikov <dplotnikov@virtuozzo.com>, Stefan Hajnoczi <stefanha@redhat.com>, "Denis V. Lunev" <den@openvz.org> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	block: seriously improve savevm performance \| expand [v5,0/6] block: seriously improve savevm performance [1/6] migration/savevm: respect qemu_fclose() error code in save_snapshot() [2/6] block/aio_task: allow start/wait task from any coroutine [3/6] block/aio_task: drop aio_task_pool_wait_one() helper [4/6] block/block-backend: remove always true check from blk_save_vmstate [5/6] block, migration: add bdrv_finalize_vmstate helper [6/6] block/io: improve savevm performance [7/6] block/io: improve loadvm performance

Message ID

20200622083303.18665-1-den@openvz.org (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C4BA6206C3
From: "Denis V. Lunev" <den@openvz.org>
To: qemu-block@nongnu.org,
	qemu-devel@nongnu.org
Subject: [PATCH 7/6] block/io: improve loadvm performance
Date: Mon, 22 Jun 2020 11:33:03 +0300
Message-Id: <20200622083303.18665-1-den@openvz.org>
In-Reply-To: <20200619100708.30440-1-den@openvz.org>
References: <20200619100708.30440-1-den@openvz.org>
Received-SPF: pass client-ip=185.231.240.75; envelope-from=den@openvz.org;
 helo=relay3.sw.ru
X-Spam_score_int: -18
X-Spam_score: -1.9
X-Spam_bar: -
X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=_AUTOLEARN
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: Kevin Wolf <kwolf@redhat.com>, Fam Zheng <fam@euphon.net>,
 Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
 Juan Quintela <quintela@redhat.com>,
 "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
 Max Reitz <mreitz@redhat.com>,
 Denis Plotnikov <dplotnikov@virtuozzo.com>,
 Stefan Hajnoczi <stefanha@redhat.com>, "Denis V. Lunev" <den@openvz.org>
Errors-To: 
 qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org
Sender: "Qemu-devel"
 <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Series

block: seriously improve savevm performance | expand

Commit Message

Denis V. Lunev June 22, 2020, 8:33 a.m. UTC

This patch creates intermediate buffer for reading from block driver
state and performs read-ahead to this buffer. Snapshot code performs
reads sequentially and thus we know what offsets will be required
and when they will become not needed.

Results are fantastic. Switch to snapshot times of 2GB Fedora 31 VM
over NVME storage are the following:
                original     fixed
cached:          1.84s       1.16s
non-cached:     12.74s       1.27s

The difference over HDD would be more significant :)

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---

 block/io.c                | 225 +++++++++++++++++++++++++++++++++++++-
 include/block/block_int.h |   3 +
 2 files changed, 225 insertions(+), 3 deletions(-)

Comments

Vladimir Sementsov-Ogievskiy June 22, 2020, 5:02 p.m. UTC | #1

22.06.2020 11:33, Denis V. Lunev wrote:
> This patch creates intermediate buffer for reading from block driver
> state and performs read-ahead to this buffer. Snapshot code performs
> reads sequentially and thus we know what offsets will be required
> and when they will become not needed.
> 
> Results are fantastic. Switch to snapshot times of 2GB Fedora 31 VM
> over NVME storage are the following:
>                  original     fixed
> cached:          1.84s       1.16s
> non-cached:     12.74s       1.27s
> 
> The difference over HDD would be more significant :)
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <fam@euphon.net>
> CC: Juan Quintela <quintela@redhat.com>
> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
> 
>   block/io.c                | 225 +++++++++++++++++++++++++++++++++++++-
>   include/block/block_int.h |   3 +
>   2 files changed, 225 insertions(+), 3 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 71a696deb7..bb06f750d8 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2739,6 +2739,180 @@ static int bdrv_co_do_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
>       }
>   }
>   
> +
> +typedef struct BdrvLoadVMChunk {
> +    void *buf;
> +    uint64_t offset;
> +    ssize_t bytes;
> +
> +    QLIST_ENTRY(BdrvLoadVMChunk) list;
> +} BdrvLoadVMChunk;
> +
> +typedef struct BdrvLoadVMState {
> +    AioTaskPool *pool;
> +
> +    int64_t offset;
> +    int64_t last_loaded;
> +
> +    int chunk_count;
> +    QLIST_HEAD(, BdrvLoadVMChunk) chunks;
> +    QLIST_HEAD(, BdrvLoadVMChunk) loading;
> +    CoMutex lock;
> +    CoQueue waiters;
> +} BdrvLoadVMState;
> +
> +typedef struct BdrvLoadVMStateTask {
> +    AioTask task;
> +
> +    BlockDriverState *bs;
> +    BdrvLoadVMChunk *chunk;
> +} BdrvLoadVMStateTask;
> +
> +static BdrvLoadVMChunk *bdrv_co_find_loadvmstate_chunk(int64_t pos,
> +                                                       BdrvLoadVMChunk *c)
> +{
> +    for (; c != NULL; c = QLIST_NEXT(c, list)) {
> +        if (c->offset <= pos && c->offset + c->bytes > pos) {
> +            return c;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static void bdrv_free_loadvm_chunk(BdrvLoadVMChunk *c)
> +{
> +    qemu_vfree(c->buf);
> +    g_free(c);
> +}
> +
> +static coroutine_fn int bdrv_co_vmstate_load_task_entry(AioTask *task)
> +{
> +    int err = 0;
> +    BdrvLoadVMStateTask *t = container_of(task, BdrvLoadVMStateTask, task);
> +    BdrvLoadVMChunk *c = t->chunk;
> +    BdrvLoadVMState *state = t->bs->loadvm_state;
> +    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, c->buf, c->bytes);
> +
> +    bdrv_inc_in_flight(t->bs);
> +    err = t->bs->drv->bdrv_load_vmstate(t->bs, &qiov, c->offset);
> +    bdrv_dec_in_flight(t->bs);
> +
> +    qemu_co_mutex_lock(&state->lock);
> +    QLIST_REMOVE(c, list);
> +    if (err == 0) {
> +        QLIST_INSERT_HEAD(&state->chunks, c, list);
> +    } else {
> +        bdrv_free_loadvm_chunk(c);
> +    }
> +    qemu_co_mutex_unlock(&state->lock);
> +    qemu_co_queue_restart_all(&state->waiters);
> +
> +    return err;
> +}
> +
> +static void bdrv_co_start_loadvmstate(BlockDriverState *bs,
> +                                      BdrvLoadVMState *state)
> +{
> +    int i;
> +    size_t buf_size = MAX(bdrv_get_cluster_size(bs), 1 * MiB);
> +
> +    qemu_co_mutex_assert_locked(&state->lock);
> +    for (i = state->chunk_count; i < BDRV_VMSTATE_WORKERS_MAX; i++) {
> +        BdrvLoadVMStateTask *t = g_new(BdrvLoadVMStateTask, 1);
> +
> +        *t = (BdrvLoadVMStateTask) {
> +            .task.func = bdrv_co_vmstate_load_task_entry,
> +            .bs = bs,
> +            .chunk = g_new(BdrvLoadVMChunk, 1),
> +        };
> +
> +        *t->chunk = (BdrvLoadVMChunk) {
> +            .buf = qemu_blockalign(bs, buf_size),
> +            .offset = state->last_loaded,
> +            .bytes = buf_size,
> +        };
> +        /* FIXME: tail of stream */
> +
> +        QLIST_INSERT_HEAD(&state->loading, t->chunk, list);
> +        state->chunk_count++;
> +        state->last_loaded += buf_size;
> +
> +        qemu_co_mutex_unlock(&state->lock);
> +        aio_task_pool_start_task(state->pool, &t->task);

Hmm, so, we unlock the mutex here. Am I missing something, or if there are
two parallel load-state requests in parallel (yes, this shouldn't happen
with current migration code, but if don't care at all the mutex is not needed),

we may have two such loops in parallel, which will create more than
BDRV_VMSTATE_WORKERS_MAX chunks, as each has own local "i" variable?

> +        qemu_co_mutex_lock(&state->lock);
> +    }
> +}
> +
> +static int bdrv_co_do_load_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
> +                                   int64_t pos)
> +{
> +    BdrvLoadVMState *state = bs->loadvm_state;
> +    BdrvLoadVMChunk *c;
> +    size_t off;
> +
> +    if (state == NULL) {
> +        if (pos != 0) {
> +            /* Normally this branch is not reachable from migration */
> +            return bs->drv->bdrv_load_vmstate(bs, qiov, pos);
> +        }
> +
> +        state = g_new(BdrvLoadVMState, 1);
> +        *state = (BdrvLoadVMState) {
> +            .pool = aio_task_pool_new(BDRV_VMSTATE_WORKERS_MAX),
> +            .chunks = QLIST_HEAD_INITIALIZER(state->chunks),
> +            .loading = QLIST_HEAD_INITIALIZER(state->loading),
> +        };
> +        qemu_co_mutex_init(&state->lock);
> +        qemu_co_queue_init(&state->waiters);
> +
> +        bs->loadvm_state = state;
> +    }
> +
> +    if (state->offset != pos) {
> +        /* Normally this branch is not reachable from migration */
> +        return bs->drv->bdrv_load_vmstate(bs, qiov, pos);
> +    }
> +
> +    off = 0;
> +    qemu_co_mutex_lock(&state->lock);
> +    bdrv_co_start_loadvmstate(bs, state);
> +
> +    while (off < qiov->size && aio_task_pool_status(state->pool) == 0) {
> +        c = bdrv_co_find_loadvmstate_chunk(pos, QLIST_FIRST(&state->chunks));
> +        if (c != NULL) {
> +            ssize_t chunk_off = pos - c->offset;
> +            ssize_t to_copy = MIN(qiov->size - off, c->bytes - chunk_off);
> +
> +            qemu_iovec_from_buf(qiov, off, c->buf + chunk_off, to_copy);
> +
> +            off += to_copy;
> +            pos += to_copy;
> +
> +            if (pos == c->offset + c->bytes) {
> +                state->chunk_count--;
> +                /* End of buffer, discard it from the list */
> +                QLIST_REMOVE(c, list);
> +                bdrv_free_loadvm_chunk(c);
> +            }
> +
> +            state->offset += to_copy;
> +            continue;
> +        }
> +
> +        c = bdrv_co_find_loadvmstate_chunk(pos, QLIST_FIRST(&state->loading));
> +        if (c != NULL) {
> +            qemu_co_queue_wait(&state->waiters, &state->lock);
> +            continue;
> +        }
> +
> +        bdrv_co_start_loadvmstate(bs, state);

Hmm, so, you assume that if we are in a loop, we'll definitely find our chunk at some moment.
It should work with normal migration stream, sequentially reading chunk by chunk. But if not,
it may hang:

Consider two reads at same position, running in parallel: they both pass "state->offset != pos"
check and wait on mutex. Then the first one will go into critical section, find and remove
corresponding chunk. Then the second will go into critical section and loop forever, because
corresponding chunk will never be generated again.

So, seems, we should check (state->offset == pos) on each iteration, or something like this.

====

Another idea came into my mind is that you create new requests when we can't find our chunk.
It may be a bit more optimal to create new requests as soon as we get a free slot after
QLIST_REMOVE(..)

And, if we do so, we don't need "loading" list: if there is no corresponding chunk in chunks,
we will just wait in "waiters" queue.



> +    }
> +    qemu_co_mutex_unlock(&state->lock);
> +
> +    return aio_task_pool_status(state->pool);
> +}
> +
>   static int coroutine_fn
>   bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
>                      bool is_read)
> @@ -2752,7 +2926,7 @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
>           ret = -ENOMEDIUM;
>       } else if (drv->bdrv_load_vmstate) {
>           if (is_read) {
> -            ret = drv->bdrv_load_vmstate(bs, qiov, pos);
> +            ret = bdrv_co_do_load_vmstate(bs, qiov, pos);
>           } else {
>               ret = bdrv_co_do_save_vmstate(bs, qiov, pos);
>           }
> @@ -2823,13 +2997,13 @@ int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
>       return bdrv_rw_vmstate(bs, qiov, pos, true);
>   }
>   
> -static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
> +static int coroutine_fn bdrv_co_finalize_save_vmstate(BlockDriverState *bs)
>   {
>       int err;
>       BdrvSaveVMState *state = bs->savevm_state;
>   
>       if (bs->drv->bdrv_save_vmstate == NULL && bs->file != NULL) {
> -        return bdrv_co_finalize_vmstate(bs->file->bs);
> +        return bdrv_co_finalize_save_vmstate(bs->file->bs);
>       }
>       if (state == NULL) {
>           return 0;
> @@ -2851,6 +3025,51 @@ static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
>       return err;
>   }
>   
> +static int coroutine_fn bdrv_co_finalize_load_vmstate(BlockDriverState *bs)
> +{
> +    int err;
> +    BdrvLoadVMState *state = bs->loadvm_state;
> +    BdrvLoadVMChunk *c, *tmp;
> +
> +    if (bs->drv->bdrv_load_vmstate == NULL && bs->file != NULL) {
> +        return bdrv_co_finalize_load_vmstate(bs->file->bs);
> +    }
> +    if (state == NULL) {
> +        return 0;
> +    }
> +
> +    aio_task_pool_wait_all(state->pool);
> +    err = aio_task_pool_status(state->pool);
> +    aio_task_pool_free(state->pool);
> +
> +    QLIST_FOREACH(c, &state->loading, list) {
> +        assert(1);  /* this list must be empty as all tasks are committed */

assert(1) never fail, abort() you mean..

or better assert(QLIST_EMPTY(&state->loading))

> +    }
> +    QLIST_FOREACH_SAFE(c, &state->chunks, list, tmp) {
> +        QLIST_REMOVE(c, list);
> +        bdrv_free_loadvm_chunk(c);
> +    }
> +
> +    g_free(state);
> +
> +    bs->loadvm_state = NULL;
> +
> +    return err;
> +}
> +
> +static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
> +{
> +    int err1 = bdrv_co_finalize_save_vmstate(bs);
> +    int err2 = bdrv_co_finalize_load_vmstate(bs);
> +    if (err1 < 0) {
> +        return err1;
> +    }
> +    if (err2 < 0) {
> +        return err2;
> +    }
> +    return 0;
> +}
> +
>   static int coroutine_fn bdrv_finalize_vmstate_co_entry(void *opaque)
>   {
>       return bdrv_co_finalize_vmstate(opaque);
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index f90f0e8b6a..0942578a74 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -788,6 +788,7 @@ struct BdrvChild {
>   
>   
>   typedef struct BdrvSaveVMState BdrvSaveVMState;
> +typedef struct BdrvLoadVMState BdrvLoadVMState;
>   
>   /*
>    * Note: the function bdrv_append() copies and swaps contents of
> @@ -955,6 +956,8 @@ struct BlockDriverState {
>   
>       /* Intermediate buffer for VM state saving from snapshot creation code */
>       BdrvSaveVMState *savevm_state;
> +    /* Intermediate buffer for VM state loading */
> +    BdrvLoadVMState *loadvm_state;
>   };
>   
>   struct BlockBackendRootState {
> 

Also, need call finalize from blk_load_vmstate(), like it's done for blk_save_vmstate().

Denis V. Lunev June 22, 2020, 5:17 p.m. UTC | #2

On 6/22/20 8:02 PM, Vladimir Sementsov-Ogievskiy wrote:
> 22.06.2020 11:33, Denis V. Lunev wrote:
>> This patch creates intermediate buffer for reading from block driver
>> state and performs read-ahead to this buffer. Snapshot code performs
>> reads sequentially and thus we know what offsets will be required
>> and when they will become not needed.
>>
>> Results are fantastic. Switch to snapshot times of 2GB Fedora 31 VM
>> over NVME storage are the following:
>>                  original     fixed
>> cached:          1.84s       1.16s
>> non-cached:     12.74s       1.27s
>>
>> The difference over HDD would be more significant :)
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <fam@euphon.net>
>> CC: Juan Quintela <quintela@redhat.com>
>> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
>> ---
>>
>>   block/io.c                | 225 +++++++++++++++++++++++++++++++++++++-
>>   include/block/block_int.h |   3 +
>>   2 files changed, 225 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/io.c b/block/io.c
>> index 71a696deb7..bb06f750d8 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -2739,6 +2739,180 @@ static int
>> bdrv_co_do_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
>>       }
>>   }
>>   +
>> +typedef struct BdrvLoadVMChunk {
>> +    void *buf;
>> +    uint64_t offset;
>> +    ssize_t bytes;
>> +
>> +    QLIST_ENTRY(BdrvLoadVMChunk) list;
>> +} BdrvLoadVMChunk;
>> +
>> +typedef struct BdrvLoadVMState {
>> +    AioTaskPool *pool;
>> +
>> +    int64_t offset;
>> +    int64_t last_loaded;
>> +
>> +    int chunk_count;
>> +    QLIST_HEAD(, BdrvLoadVMChunk) chunks;
>> +    QLIST_HEAD(, BdrvLoadVMChunk) loading;
>> +    CoMutex lock;
>> +    CoQueue waiters;
>> +} BdrvLoadVMState;
>> +
>> +typedef struct BdrvLoadVMStateTask {
>> +    AioTask task;
>> +
>> +    BlockDriverState *bs;
>> +    BdrvLoadVMChunk *chunk;
>> +} BdrvLoadVMStateTask;
>> +
>> +static BdrvLoadVMChunk *bdrv_co_find_loadvmstate_chunk(int64_t pos,
>> +                                                      
>> BdrvLoadVMChunk *c)
>> +{
>> +    for (; c != NULL; c = QLIST_NEXT(c, list)) {
>> +        if (c->offset <= pos && c->offset + c->bytes > pos) {
>> +            return c;
>> +        }
>> +    }
>> +
>> +    return NULL;
>> +}
>> +
>> +static void bdrv_free_loadvm_chunk(BdrvLoadVMChunk *c)
>> +{
>> +    qemu_vfree(c->buf);
>> +    g_free(c);
>> +}
>> +
>> +static coroutine_fn int bdrv_co_vmstate_load_task_entry(AioTask *task)
>> +{
>> +    int err = 0;
>> +    BdrvLoadVMStateTask *t = container_of(task, BdrvLoadVMStateTask,
>> task);
>> +    BdrvLoadVMChunk *c = t->chunk;
>> +    BdrvLoadVMState *state = t->bs->loadvm_state;
>> +    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, c->buf, c->bytes);
>> +
>> +    bdrv_inc_in_flight(t->bs);
>> +    err = t->bs->drv->bdrv_load_vmstate(t->bs, &qiov, c->offset);
>> +    bdrv_dec_in_flight(t->bs);
>> +
>> +    qemu_co_mutex_lock(&state->lock);
>> +    QLIST_REMOVE(c, list);
>> +    if (err == 0) {
>> +        QLIST_INSERT_HEAD(&state->chunks, c, list);
>> +    } else {
>> +        bdrv_free_loadvm_chunk(c);
>> +    }
>> +    qemu_co_mutex_unlock(&state->lock);
>> +    qemu_co_queue_restart_all(&state->waiters);
>> +
>> +    return err;
>> +}
>> +
>> +static void bdrv_co_start_loadvmstate(BlockDriverState *bs,
>> +                                      BdrvLoadVMState *state)
>> +{
>> +    int i;
>> +    size_t buf_size = MAX(bdrv_get_cluster_size(bs), 1 * MiB);
>> +
>> +    qemu_co_mutex_assert_locked(&state->lock);
>> +    for (i = state->chunk_count; i < BDRV_VMSTATE_WORKERS_MAX; i++) {
>> +        BdrvLoadVMStateTask *t = g_new(BdrvLoadVMStateTask, 1);
>> +
>> +        *t = (BdrvLoadVMStateTask) {
>> +            .task.func = bdrv_co_vmstate_load_task_entry,
>> +            .bs = bs,
>> +            .chunk = g_new(BdrvLoadVMChunk, 1),
>> +        };
>> +
>> +        *t->chunk = (BdrvLoadVMChunk) {
>> +            .buf = qemu_blockalign(bs, buf_size),
>> +            .offset = state->last_loaded,
>> +            .bytes = buf_size,
>> +        };
>> +        /* FIXME: tail of stream */
>> +
>> +        QLIST_INSERT_HEAD(&state->loading, t->chunk, list);
>> +        state->chunk_count++;
>> +        state->last_loaded += buf_size;
>> +
>> +        qemu_co_mutex_unlock(&state->lock);
>> +        aio_task_pool_start_task(state->pool, &t->task);
>
> Hmm, so, we unlock the mutex here. Am I missing something, or if there
> are
> two parallel load-state requests in parallel (yes, this shouldn't happen
> with current migration code, but if don't care at all the mutex is not
> needed),
>
> we may have two such loops in parallel, which will create more than
> BDRV_VMSTATE_WORKERS_MAX chunks, as each has own local "i" variable?

I could stall on waiting free slot in the pool (technically)
while pool completion will need the lock. This should
NOT happen with the current arch, but would be better
to do things correctly, i.e. drop the lock.

>
>> +        qemu_co_mutex_lock(&state->lock);
>> +    }
>> +}
>> +
>> +static int bdrv_co_do_load_vmstate(BlockDriverState *bs,
>> QEMUIOVector *qiov,
>> +                                   int64_t pos)
>> +{
>> +    BdrvLoadVMState *state = bs->loadvm_state;
>> +    BdrvLoadVMChunk *c;
>> +    size_t off;
>> +
>> +    if (state == NULL) {
>> +        if (pos != 0) {
>> +            /* Normally this branch is not reachable from migration */
>> +            return bs->drv->bdrv_load_vmstate(bs, qiov, pos);
>> +        }
>> +
>> +        state = g_new(BdrvLoadVMState, 1);
>> +        *state = (BdrvLoadVMState) {
>> +            .pool = aio_task_pool_new(BDRV_VMSTATE_WORKERS_MAX),
>> +            .chunks = QLIST_HEAD_INITIALIZER(state->chunks),
>> +            .loading = QLIST_HEAD_INITIALIZER(state->loading),
>> +        };
>> +        qemu_co_mutex_init(&state->lock);
>> +        qemu_co_queue_init(&state->waiters);
>> +
>> +        bs->loadvm_state = state;
>> +    }
>> +
>> +    if (state->offset != pos) {
>> +        /* Normally this branch is not reachable from migration */
>> +        return bs->drv->bdrv_load_vmstate(bs, qiov, pos);
>> +    }
>> +
>> +    off = 0;
>> +    qemu_co_mutex_lock(&state->lock);
>> +    bdrv_co_start_loadvmstate(bs, state);
>> +
>> +    while (off < qiov->size && aio_task_pool_status(state->pool) ==
>> 0) {
>> +        c = bdrv_co_find_loadvmstate_chunk(pos,
>> QLIST_FIRST(&state->chunks));
>> +        if (c != NULL) {
>> +            ssize_t chunk_off = pos - c->offset;
>> +            ssize_t to_copy = MIN(qiov->size - off, c->bytes -
>> chunk_off);
>> +
>> +            qemu_iovec_from_buf(qiov, off, c->buf + chunk_off,
>> to_copy);
>> +
>> +            off += to_copy;
>> +            pos += to_copy;
>> +
>> +            if (pos == c->offset + c->bytes) {
>> +                state->chunk_count--;
>> +                /* End of buffer, discard it from the list */
>> +                QLIST_REMOVE(c, list);
>> +                bdrv_free_loadvm_chunk(c);
>> +            }
>> +
>> +            state->offset += to_copy;
>> +            continue;
>> +        }
>> +
>> +        c = bdrv_co_find_loadvmstate_chunk(pos,
>> QLIST_FIRST(&state->loading));
>> +        if (c != NULL) {
>> +            qemu_co_queue_wait(&state->waiters, &state->lock);
>> +            continue;
>> +        }
>> +
>> +        bdrv_co_start_loadvmstate(bs, state);
>
> Hmm, so, you assume that if we are in a loop, we'll definitely find
> our chunk at some moment.
> It should work with normal migration stream, sequentially reading
> chunk by chunk. But if not,
> it may hang:
>
> Consider two reads at same position, running in parallel: they both
> pass "state->offset != pos"
> check and wait on mutex. Then the first one will go into critical
> section, find and remove
> corresponding chunk. Then the second will go into critical section and
> loop forever, because
> corresponding chunk will never be generated again.
>
> So, seems, we should check (state->offset == pos) on each iteration,
> or something like this.
>
we can. But the reader is synchronous and could not
issue 2 requests. Not a problem.


> ====
>
> Another idea came into my mind is that you create new requests when we
> can't find our chunk.
> It may be a bit more optimal to create new requests as soon as we get
> a free slot after
> QLIST_REMOVE(..)
>
> And, if we do so, we don't need "loading" list: if there is no
> corresponding chunk in chunks,
> we will just wait in "waiters" queue.
>
no. I would prefer to keep all requests in some lists.
This would save a ton of time if something will
go wrong.

But the idea of refilling near REMOVE is fine to
work with.

>
>
>> +    }
>> +    qemu_co_mutex_unlock(&state->lock);
>> +
>> +    return aio_task_pool_status(state->pool);
>> +}
>> +
>>   static int coroutine_fn
>>   bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
>> int64_t pos,
>>                      bool is_read)
>> @@ -2752,7 +2926,7 @@ bdrv_co_rw_vmstate(BlockDriverState *bs,
>> QEMUIOVector *qiov, int64_t pos,
>>           ret = -ENOMEDIUM;
>>       } else if (drv->bdrv_load_vmstate) {
>>           if (is_read) {
>> -            ret = drv->bdrv_load_vmstate(bs, qiov, pos);
>> +            ret = bdrv_co_do_load_vmstate(bs, qiov, pos);
>>           } else {
>>               ret = bdrv_co_do_save_vmstate(bs, qiov, pos);
>>           }
>> @@ -2823,13 +2997,13 @@ int bdrv_readv_vmstate(BlockDriverState *bs,
>> QEMUIOVector *qiov, int64_t pos)
>>       return bdrv_rw_vmstate(bs, qiov, pos, true);
>>   }
>>   -static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState
>> *bs)
>> +static int coroutine_fn
>> bdrv_co_finalize_save_vmstate(BlockDriverState *bs)
>>   {
>>       int err;
>>       BdrvSaveVMState *state = bs->savevm_state;
>>         if (bs->drv->bdrv_save_vmstate == NULL && bs->file != NULL) {
>> -        return bdrv_co_finalize_vmstate(bs->file->bs);
>> +        return bdrv_co_finalize_save_vmstate(bs->file->bs);
>>       }
>>       if (state == NULL) {
>>           return 0;
>> @@ -2851,6 +3025,51 @@ static int coroutine_fn
>> bdrv_co_finalize_vmstate(BlockDriverState *bs)
>>       return err;
>>   }
>>   +static int coroutine_fn
>> bdrv_co_finalize_load_vmstate(BlockDriverState *bs)
>> +{
>> +    int err;
>> +    BdrvLoadVMState *state = bs->loadvm_state;
>> +    BdrvLoadVMChunk *c, *tmp;
>> +
>> +    if (bs->drv->bdrv_load_vmstate == NULL && bs->file != NULL) {
>> +        return bdrv_co_finalize_load_vmstate(bs->file->bs);
>> +    }
>> +    if (state == NULL) {
>> +        return 0;
>> +    }
>> +
>> +    aio_task_pool_wait_all(state->pool);
>> +    err = aio_task_pool_status(state->pool);
>> +    aio_task_pool_free(state->pool);
>> +
>> +    QLIST_FOREACH(c, &state->loading, list) {
>> +        assert(1);  /* this list must be empty as all tasks are
>> committed */
>
> assert(1) never fail, abort() you mean..
>
> or better assert(QLIST_EMPTY(&state->loading))
>
yep!

>> +    }
>> +    QLIST_FOREACH_SAFE(c, &state->chunks, list, tmp) {
>> +        QLIST_REMOVE(c, list);
>> +        bdrv_free_loadvm_chunk(c);
>> +    }
>> +
>> +    g_free(state);
>> +
>> +    bs->loadvm_state = NULL;
>> +
>> +    return err;
>> +}
>> +
>> +static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
>> +{
>> +    int err1 = bdrv_co_finalize_save_vmstate(bs);
>> +    int err2 = bdrv_co_finalize_load_vmstate(bs);
>> +    if (err1 < 0) {
>> +        return err1;
>> +    }
>> +    if (err2 < 0) {
>> +        return err2;
>> +    }
>> +    return 0;
>> +}
>> +
>>   static int coroutine_fn bdrv_finalize_vmstate_co_entry(void *opaque)
>>   {
>>       return bdrv_co_finalize_vmstate(opaque);
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index f90f0e8b6a..0942578a74 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -788,6 +788,7 @@ struct BdrvChild {
>>       typedef struct BdrvSaveVMState BdrvSaveVMState;
>> +typedef struct BdrvLoadVMState BdrvLoadVMState;
>>     /*
>>    * Note: the function bdrv_append() copies and swaps contents of
>> @@ -955,6 +956,8 @@ struct BlockDriverState {
>>         /* Intermediate buffer for VM state saving from snapshot
>> creation code */
>>       BdrvSaveVMState *savevm_state;
>> +    /* Intermediate buffer for VM state loading */
>> +    BdrvLoadVMState *loadvm_state;
>>   };
>>     struct BlockBackendRootState {
>>
>
> Also, need call finalize from blk_load_vmstate(), like it's done for
> blk_save_vmstate().
>
good catch!

Vladimir Sementsov-Ogievskiy June 22, 2020, 6:13 p.m. UTC | #3

22.06.2020 20:17, Denis V. Lunev wrote:
> On 6/22/20 8:02 PM, Vladimir Sementsov-Ogievskiy wrote:
>> 22.06.2020 11:33, Denis V. Lunev wrote:
>>> This patch creates intermediate buffer for reading from block driver
>>> state and performs read-ahead to this buffer. Snapshot code performs
>>> reads sequentially and thus we know what offsets will be required
>>> and when they will become not needed.
>>>
>>> Results are fantastic. Switch to snapshot times of 2GB Fedora 31 VM
>>> over NVME storage are the following:
>>>                   original     fixed
>>> cached:          1.84s       1.16s
>>> non-cached:     12.74s       1.27s
>>>
>>> The difference over HDD would be more significant :)
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Max Reitz <mreitz@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> CC: Fam Zheng <fam@euphon.net>
>>> CC: Juan Quintela <quintela@redhat.com>
>>> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
>>> ---
>>>
>>>    block/io.c                | 225 +++++++++++++++++++++++++++++++++++++-
>>>    include/block/block_int.h |   3 +
>>>    2 files changed, 225 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/io.c b/block/io.c
>>> index 71a696deb7..bb06f750d8 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -2739,6 +2739,180 @@ static int
>>> bdrv_co_do_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
>>>        }
>>>    }
>>>    +
>>> +typedef struct BdrvLoadVMChunk {
>>> +    void *buf;
>>> +    uint64_t offset;
>>> +    ssize_t bytes;
>>> +
>>> +    QLIST_ENTRY(BdrvLoadVMChunk) list;
>>> +} BdrvLoadVMChunk;
>>> +
>>> +typedef struct BdrvLoadVMState {
>>> +    AioTaskPool *pool;
>>> +
>>> +    int64_t offset;
>>> +    int64_t last_loaded;
>>> +
>>> +    int chunk_count;
>>> +    QLIST_HEAD(, BdrvLoadVMChunk) chunks;
>>> +    QLIST_HEAD(, BdrvLoadVMChunk) loading;
>>> +    CoMutex lock;
>>> +    CoQueue waiters;
>>> +} BdrvLoadVMState;
>>> +
>>> +typedef struct BdrvLoadVMStateTask {
>>> +    AioTask task;
>>> +
>>> +    BlockDriverState *bs;
>>> +    BdrvLoadVMChunk *chunk;
>>> +} BdrvLoadVMStateTask;
>>> +
>>> +static BdrvLoadVMChunk *bdrv_co_find_loadvmstate_chunk(int64_t pos,
>>> +
>>> BdrvLoadVMChunk *c)
>>> +{
>>> +    for (; c != NULL; c = QLIST_NEXT(c, list)) {
>>> +        if (c->offset <= pos && c->offset + c->bytes > pos) {
>>> +            return c;
>>> +        }
>>> +    }
>>> +
>>> +    return NULL;
>>> +}
>>> +
>>> +static void bdrv_free_loadvm_chunk(BdrvLoadVMChunk *c)
>>> +{
>>> +    qemu_vfree(c->buf);
>>> +    g_free(c);
>>> +}
>>> +
>>> +static coroutine_fn int bdrv_co_vmstate_load_task_entry(AioTask *task)
>>> +{
>>> +    int err = 0;
>>> +    BdrvLoadVMStateTask *t = container_of(task, BdrvLoadVMStateTask,
>>> task);
>>> +    BdrvLoadVMChunk *c = t->chunk;
>>> +    BdrvLoadVMState *state = t->bs->loadvm_state;
>>> +    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, c->buf, c->bytes);
>>> +
>>> +    bdrv_inc_in_flight(t->bs);
>>> +    err = t->bs->drv->bdrv_load_vmstate(t->bs, &qiov, c->offset);
>>> +    bdrv_dec_in_flight(t->bs);
>>> +
>>> +    qemu_co_mutex_lock(&state->lock);
>>> +    QLIST_REMOVE(c, list);
>>> +    if (err == 0) {
>>> +        QLIST_INSERT_HEAD(&state->chunks, c, list);
>>> +    } else {
>>> +        bdrv_free_loadvm_chunk(c);
>>> +    }
>>> +    qemu_co_mutex_unlock(&state->lock);
>>> +    qemu_co_queue_restart_all(&state->waiters);
>>> +
>>> +    return err;
>>> +}
>>> +
>>> +static void bdrv_co_start_loadvmstate(BlockDriverState *bs,
>>> +                                      BdrvLoadVMState *state)
>>> +{
>>> +    int i;
>>> +    size_t buf_size = MAX(bdrv_get_cluster_size(bs), 1 * MiB);
>>> +
>>> +    qemu_co_mutex_assert_locked(&state->lock);
>>> +    for (i = state->chunk_count; i < BDRV_VMSTATE_WORKERS_MAX; i++) {
>>> +        BdrvLoadVMStateTask *t = g_new(BdrvLoadVMStateTask, 1);
>>> +
>>> +        *t = (BdrvLoadVMStateTask) {
>>> +            .task.func = bdrv_co_vmstate_load_task_entry,
>>> +            .bs = bs,
>>> +            .chunk = g_new(BdrvLoadVMChunk, 1),
>>> +        };
>>> +
>>> +        *t->chunk = (BdrvLoadVMChunk) {
>>> +            .buf = qemu_blockalign(bs, buf_size),
>>> +            .offset = state->last_loaded,
>>> +            .bytes = buf_size,
>>> +        };
>>> +        /* FIXME: tail of stream */
>>> +
>>> +        QLIST_INSERT_HEAD(&state->loading, t->chunk, list);
>>> +        state->chunk_count++;
>>> +        state->last_loaded += buf_size;
>>> +
>>> +        qemu_co_mutex_unlock(&state->lock);
>>> +        aio_task_pool_start_task(state->pool, &t->task);
>>
>> Hmm, so, we unlock the mutex here. Am I missing something, or if there
>> are
>> two parallel load-state requests in parallel (yes, this shouldn't happen
>> with current migration code, but if don't care at all the mutex is not
>> needed),
>>
>> we may have two such loops in parallel, which will create more than
>> BDRV_VMSTATE_WORKERS_MAX chunks, as each has own local "i" variable?
> 
> I could stall on waiting free slot in the pool (technically)
> while pool completion will need the lock. This should
> NOT happen with the current arch, but would be better
> to do things correctly, i.e. drop the lock.
> 
>>
>>> +        qemu_co_mutex_lock(&state->lock);
>>> +    }
>>> +}
>>> +
>>> +static int bdrv_co_do_load_vmstate(BlockDriverState *bs,
>>> QEMUIOVector *qiov,
>>> +                                   int64_t pos)
>>> +{
>>> +    BdrvLoadVMState *state = bs->loadvm_state;
>>> +    BdrvLoadVMChunk *c;
>>> +    size_t off;
>>> +
>>> +    if (state == NULL) {
>>> +        if (pos != 0) {
>>> +            /* Normally this branch is not reachable from migration */
>>> +            return bs->drv->bdrv_load_vmstate(bs, qiov, pos);
>>> +        }
>>> +
>>> +        state = g_new(BdrvLoadVMState, 1);
>>> +        *state = (BdrvLoadVMState) {
>>> +            .pool = aio_task_pool_new(BDRV_VMSTATE_WORKERS_MAX),
>>> +            .chunks = QLIST_HEAD_INITIALIZER(state->chunks),
>>> +            .loading = QLIST_HEAD_INITIALIZER(state->loading),
>>> +        };
>>> +        qemu_co_mutex_init(&state->lock);
>>> +        qemu_co_queue_init(&state->waiters);
>>> +
>>> +        bs->loadvm_state = state;
>>> +    }
>>> +
>>> +    if (state->offset != pos) {
>>> +        /* Normally this branch is not reachable from migration */
>>> +        return bs->drv->bdrv_load_vmstate(bs, qiov, pos);
>>> +    }
>>> +
>>> +    off = 0;
>>> +    qemu_co_mutex_lock(&state->lock);
>>> +    bdrv_co_start_loadvmstate(bs, state);
>>> +
>>> +    while (off < qiov->size && aio_task_pool_status(state->pool) ==
>>> 0) {
>>> +        c = bdrv_co_find_loadvmstate_chunk(pos,
>>> QLIST_FIRST(&state->chunks));
>>> +        if (c != NULL) {
>>> +            ssize_t chunk_off = pos - c->offset;
>>> +            ssize_t to_copy = MIN(qiov->size - off, c->bytes -
>>> chunk_off);
>>> +
>>> +            qemu_iovec_from_buf(qiov, off, c->buf + chunk_off,
>>> to_copy);
>>> +
>>> +            off += to_copy;
>>> +            pos += to_copy;
>>> +
>>> +            if (pos == c->offset + c->bytes) {
>>> +                state->chunk_count--;
>>> +                /* End of buffer, discard it from the list */
>>> +                QLIST_REMOVE(c, list);
>>> +                bdrv_free_loadvm_chunk(c);
>>> +            }
>>> +
>>> +            state->offset += to_copy;
>>> +            continue;
>>> +        }
>>> +
>>> +        c = bdrv_co_find_loadvmstate_chunk(pos,
>>> QLIST_FIRST(&state->loading));
>>> +        if (c != NULL) {
>>> +            qemu_co_queue_wait(&state->waiters, &state->lock);
>>> +            continue;
>>> +        }
>>> +
>>> +        bdrv_co_start_loadvmstate(bs, state);
>>
>> Hmm, so, you assume that if we are in a loop, we'll definitely find
>> our chunk at some moment.
>> It should work with normal migration stream, sequentially reading
>> chunk by chunk. But if not,
>> it may hang:
>>
>> Consider two reads at same position, running in parallel: they both
>> pass "state->offset != pos"
>> check and wait on mutex. Then the first one will go into critical
>> section, find and remove
>> corresponding chunk. Then the second will go into critical section and
>> loop forever, because
>> corresponding chunk will never be generated again.
>>
>> So, seems, we should check (state->offset == pos) on each iteration,
>> or something like this.
>>
> we can. But the reader is synchronous and could not
> issue 2 requests. Not a problem.
> 

I think, it's OK to rely on current one-synchronous-reader arch, but some
relating comments/assertions won't hurt.

diff --git a/block/io.c b/block/io.c
index 71a696deb7..bb06f750d8 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2739,6 +2739,180 @@  static int bdrv_co_do_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
     }
 }
 
+
+typedef struct BdrvLoadVMChunk {
+    void *buf;
+    uint64_t offset;
+    ssize_t bytes;
+
+    QLIST_ENTRY(BdrvLoadVMChunk) list;
+} BdrvLoadVMChunk;
+
+typedef struct BdrvLoadVMState {
+    AioTaskPool *pool;
+
+    int64_t offset;
+    int64_t last_loaded;
+
+    int chunk_count;
+    QLIST_HEAD(, BdrvLoadVMChunk) chunks;
+    QLIST_HEAD(, BdrvLoadVMChunk) loading;
+    CoMutex lock;
+    CoQueue waiters;
+} BdrvLoadVMState;
+
+typedef struct BdrvLoadVMStateTask {
+    AioTask task;
+
+    BlockDriverState *bs;
+    BdrvLoadVMChunk *chunk;
+} BdrvLoadVMStateTask;
+
+static BdrvLoadVMChunk *bdrv_co_find_loadvmstate_chunk(int64_t pos,
+                                                       BdrvLoadVMChunk *c)
+{
+    for (; c != NULL; c = QLIST_NEXT(c, list)) {
+        if (c->offset <= pos && c->offset + c->bytes > pos) {
+            return c;
+        }
+    }
+
+    return NULL;
+}
+
+static void bdrv_free_loadvm_chunk(BdrvLoadVMChunk *c)
+{
+    qemu_vfree(c->buf);
+    g_free(c);
+}
+
+static coroutine_fn int bdrv_co_vmstate_load_task_entry(AioTask *task)
+{
+    int err = 0;
+    BdrvLoadVMStateTask *t = container_of(task, BdrvLoadVMStateTask, task);
+    BdrvLoadVMChunk *c = t->chunk;
+    BdrvLoadVMState *state = t->bs->loadvm_state;
+    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, c->buf, c->bytes);
+
+    bdrv_inc_in_flight(t->bs);
+    err = t->bs->drv->bdrv_load_vmstate(t->bs, &qiov, c->offset);
+    bdrv_dec_in_flight(t->bs);
+
+    qemu_co_mutex_lock(&state->lock);
+    QLIST_REMOVE(c, list);
+    if (err == 0) {
+        QLIST_INSERT_HEAD(&state->chunks, c, list);
+    } else {
+        bdrv_free_loadvm_chunk(c);
+    }
+    qemu_co_mutex_unlock(&state->lock);
+    qemu_co_queue_restart_all(&state->waiters);
+
+    return err;
+}
+
+static void bdrv_co_start_loadvmstate(BlockDriverState *bs,
+                                      BdrvLoadVMState *state)
+{
+    int i;
+    size_t buf_size = MAX(bdrv_get_cluster_size(bs), 1 * MiB);
+
+    qemu_co_mutex_assert_locked(&state->lock);
+    for (i = state->chunk_count; i < BDRV_VMSTATE_WORKERS_MAX; i++) {
+        BdrvLoadVMStateTask *t = g_new(BdrvLoadVMStateTask, 1);
+
+        *t = (BdrvLoadVMStateTask) {
+            .task.func = bdrv_co_vmstate_load_task_entry,
+            .bs = bs,
+            .chunk = g_new(BdrvLoadVMChunk, 1),
+        };
+
+        *t->chunk = (BdrvLoadVMChunk) {
+            .buf = qemu_blockalign(bs, buf_size),
+            .offset = state->last_loaded,
+            .bytes = buf_size,
+        };
+        /* FIXME: tail of stream */
+
+        QLIST_INSERT_HEAD(&state->loading, t->chunk, list);
+        state->chunk_count++;
+        state->last_loaded += buf_size;
+
+        qemu_co_mutex_unlock(&state->lock);
+        aio_task_pool_start_task(state->pool, &t->task);
+        qemu_co_mutex_lock(&state->lock);
+    }
+}
+
+static int bdrv_co_do_load_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
+                                   int64_t pos)
+{
+    BdrvLoadVMState *state = bs->loadvm_state;
+    BdrvLoadVMChunk *c;
+    size_t off;
+
+    if (state == NULL) {
+        if (pos != 0) {
+            /* Normally this branch is not reachable from migration */
+            return bs->drv->bdrv_load_vmstate(bs, qiov, pos);
+        }
+
+        state = g_new(BdrvLoadVMState, 1);
+        *state = (BdrvLoadVMState) {
+            .pool = aio_task_pool_new(BDRV_VMSTATE_WORKERS_MAX),
+            .chunks = QLIST_HEAD_INITIALIZER(state->chunks),
+            .loading = QLIST_HEAD_INITIALIZER(state->loading),
+        };
+        qemu_co_mutex_init(&state->lock);
+        qemu_co_queue_init(&state->waiters);
+
+        bs->loadvm_state = state;
+    }
+
+    if (state->offset != pos) {
+        /* Normally this branch is not reachable from migration */
+        return bs->drv->bdrv_load_vmstate(bs, qiov, pos);
+    }
+
+    off = 0;
+    qemu_co_mutex_lock(&state->lock);
+    bdrv_co_start_loadvmstate(bs, state);
+
+    while (off < qiov->size && aio_task_pool_status(state->pool) == 0) {
+        c = bdrv_co_find_loadvmstate_chunk(pos, QLIST_FIRST(&state->chunks));
+        if (c != NULL) {
+            ssize_t chunk_off = pos - c->offset;
+            ssize_t to_copy = MIN(qiov->size - off, c->bytes - chunk_off);
+
+            qemu_iovec_from_buf(qiov, off, c->buf + chunk_off, to_copy);
+
+            off += to_copy;
+            pos += to_copy;
+
+            if (pos == c->offset + c->bytes) {
+                state->chunk_count--;
+                /* End of buffer, discard it from the list */
+                QLIST_REMOVE(c, list);
+                bdrv_free_loadvm_chunk(c);
+            }
+
+            state->offset += to_copy;
+            continue;
+        }
+
+        c = bdrv_co_find_loadvmstate_chunk(pos, QLIST_FIRST(&state->loading));
+        if (c != NULL) {
+            qemu_co_queue_wait(&state->waiters, &state->lock);
+            continue;
+        }
+
+        bdrv_co_start_loadvmstate(bs, state);
+    }
+    qemu_co_mutex_unlock(&state->lock);
+
+    return aio_task_pool_status(state->pool);
+}
+
 static int coroutine_fn
 bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                    bool is_read)
@@ -2752,7 +2926,7 @@  bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
         ret = -ENOMEDIUM;
     } else if (drv->bdrv_load_vmstate) {
         if (is_read) {
-            ret = drv->bdrv_load_vmstate(bs, qiov, pos);
+            ret = bdrv_co_do_load_vmstate(bs, qiov, pos);
         } else {
             ret = bdrv_co_do_save_vmstate(bs, qiov, pos);
         }
@@ -2823,13 +2997,13 @@  int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
     return bdrv_rw_vmstate(bs, qiov, pos, true);
 }
 
-static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
+static int coroutine_fn bdrv_co_finalize_save_vmstate(BlockDriverState *bs)
 {
     int err;
     BdrvSaveVMState *state = bs->savevm_state;
 
     if (bs->drv->bdrv_save_vmstate == NULL && bs->file != NULL) {
-        return bdrv_co_finalize_vmstate(bs->file->bs);
+        return bdrv_co_finalize_save_vmstate(bs->file->bs);
     }
     if (state == NULL) {
         return 0;
@@ -2851,6 +3025,51 @@  static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
     return err;
 }
 
+static int coroutine_fn bdrv_co_finalize_load_vmstate(BlockDriverState *bs)
+{
+    int err;
+    BdrvLoadVMState *state = bs->loadvm_state;
+    BdrvLoadVMChunk *c, *tmp;
+
+    if (bs->drv->bdrv_load_vmstate == NULL && bs->file != NULL) {
+        return bdrv_co_finalize_load_vmstate(bs->file->bs);
+    }
+    if (state == NULL) {
+        return 0;
+    }
+
+    aio_task_pool_wait_all(state->pool);
+    err = aio_task_pool_status(state->pool);
+    aio_task_pool_free(state->pool);
+
+    QLIST_FOREACH(c, &state->loading, list) {
+        assert(1);  /* this list must be empty as all tasks are committed */
+    }
+    QLIST_FOREACH_SAFE(c, &state->chunks, list, tmp) {
+        QLIST_REMOVE(c, list);
+        bdrv_free_loadvm_chunk(c);
+    }
+
+    g_free(state);
+
+    bs->loadvm_state = NULL;
+
+    return err;
+}
+
+static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
+{
+    int err1 = bdrv_co_finalize_save_vmstate(bs);
+    int err2 = bdrv_co_finalize_load_vmstate(bs);
+    if (err1 < 0) {
+        return err1;
+    }
+    if (err2 < 0) {
+        return err2;
+    }
+    return 0;
+}
+
 static int coroutine_fn bdrv_finalize_vmstate_co_entry(void *opaque)
 {
     return bdrv_co_finalize_vmstate(opaque);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f90f0e8b6a..0942578a74 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -788,6 +788,7 @@  struct BdrvChild {
 
 
 typedef struct BdrvSaveVMState BdrvSaveVMState;
+typedef struct BdrvLoadVMState BdrvLoadVMState;
 
 /*
  * Note: the function bdrv_append() copies and swaps contents of
@@ -955,6 +956,8 @@  struct BlockDriverState {
 
     /* Intermediate buffer for VM state saving from snapshot creation code */
     BdrvSaveVMState *savevm_state;
+    /* Intermediate buffer for VM state loading */
+    BdrvLoadVMState *loadvm_state;
 };
 
 struct BlockBackendRootState {

[7/6] block/io: improve loadvm performance

Commit Message

Comments

Patch