diff mbox

[3/3] replay: introduce block devices record/replay

Message ID 20160209055524.8208.16023.stgit@PASHA-ISP (mailing list archive)
State New, archived
Headers show

Commit Message

Pavel Dovgalyuk Feb. 9, 2016, 5:55 a.m. UTC
This patch introduces a set of functions that implement recording
and replaying of block devices' operations. These functions form a thin
layer between blk_aio_ functions and replay subsystem.
All asynchronous block requests are added to the queue to be processed
at deterministically invoked record/replay checkpoints.
Queue is flushed at checkpoints and information about processed requests
is recorded to the log. In replay phase the queue is matched with
events read from the log. Therefore block devices requests are processed
deterministically.

Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
---
 block/block-backend.c          |   71 +++++++++++---
 include/sysemu/block-backend.h |   16 +++
 include/sysemu/replay.h        |   27 +++++
 replay/Makefile.objs           |    1 
 replay/replay-block.c          |  202 ++++++++++++++++++++++++++++++++++++++++
 replay/replay-events.c         |   51 ++++++++++
 replay/replay-internal.h       |   24 +++++
 stubs/replay.c                 |   46 +++++++++
 8 files changed, 423 insertions(+), 15 deletions(-)
 create mode 100755 replay/replay-block.c

Comments

Kevin Wolf Feb. 9, 2016, 10:27 a.m. UTC | #1
Am 09.02.2016 um 06:55 hat Pavel Dovgalyuk geschrieben:
> This patch introduces a set of functions that implement recording
> and replaying of block devices' operations. These functions form a thin
> layer between blk_aio_ functions and replay subsystem.
> All asynchronous block requests are added to the queue to be processed
> at deterministically invoked record/replay checkpoints.
> Queue is flushed at checkpoints and information about processed requests
> is recorded to the log. In replay phase the queue is matched with
> events read from the log. Therefore block devices requests are processed
> deterministically.
> 
> Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>

This series doesn't seem to apply to current master, so it's somewhat
hard to look at the end result. I can see just from patches, though,
that this will need some more discussion.

Just picking one example of how you convert blk_* functions:

> -BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
> -                                 int nb_sectors, BdrvRequestFlags flags,
> -                                 BlockCompletionFunc *cb, void *opaque)
> +BlockAIOCB *blk_aio_write_zeroes_impl(BlockBackend *blk, int64_t sector_num,
> +                                      int nb_sectors, BdrvRequestFlags flags,
> +                                      BlockCompletionFunc *cb, void *opaque)
>  {
>      int ret = blk_check_request(blk, sector_num, nb_sectors);
>      if (ret < 0) {
> @@ -673,6 +674,13 @@ BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
>                                   cb, opaque);
>  }
>  
> +BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
> +                                 int nb_sectors, BdrvRequestFlags flags,
> +                                 BlockCompletionFunc *cb, void *opaque)
> +{
> +    return replay_aio_write_zeroes(blk, sector_num, nb_sectors, flags, cb, opaque);
> +}
> +

> +BlockAIOCB *replay_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
> +                                    int nb_sectors, BdrvRequestFlags flags,
> +                                    BlockCompletionFunc *cb, void *opaque)
> +{
> +    if (replay_mode == REPLAY_MODE_NONE) {
> +        return blk_aio_write_zeroes_impl(blk, sector_num, nb_sectors, flags, cb, opaque);
> +    } else {
> +        ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES,
> +                                             blk, cb, opaque);
> +        acb->req.sector = sector_num;
> +        acb->req.nb_sectors = nb_sectors;
> +        acb->req.flags = flags;
> +        replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES, acb, NULL, 0);
> +
> +        return &acb->common;
> +    }
> +}

I think it's obvious that adding two functions to the call chain which
do nothing in the common case is a bit ugly. If we did this for every
feature that could possibly be enabled, we'd end up with two-kilometer
stack traces.

So definitely don't call into replay.c, which just calls back in 99.9%
of the cases, but if anything, do the check in block-backends.c.


But even this doesn't feel completely right, because block drivers are
already layered and there is no need to hardcode something optional (and
rarely used) in the hot code path that could just be another layer.

I assume that you know beforehand if you want to replay something, so
requiring you to configure your block devices with a replay driver on
top of the stack seems reasonable enough.

Kevin
Pavel Dovgalyuk Feb. 9, 2016, 11:52 a.m. UTC | #2
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 09.02.2016 um 06:55 hat Pavel Dovgalyuk geschrieben:
> > This patch introduces a set of functions that implement recording
> > and replaying of block devices' operations. These functions form a thin
> > layer between blk_aio_ functions and replay subsystem.
> > All asynchronous block requests are added to the queue to be processed
> > at deterministically invoked record/replay checkpoints.
> > Queue is flushed at checkpoints and information about processed requests
> > is recorded to the log. In replay phase the queue is matched with
> > events read from the log. Therefore block devices requests are processed
> > deterministically.
> >
> > Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
> 
> This series doesn't seem to apply to current master, so it's somewhat
> hard to look at the end result. I can see just from patches, though,
> that this will need some more discussion.

Thank you for the response.
I forgot to rebase these patches, but there are minor problems with applying them.

> 
> Just picking one example of how you convert blk_* functions:
> 
> > -BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
> > -                                 int nb_sectors, BdrvRequestFlags flags,
> > -                                 BlockCompletionFunc *cb, void *opaque)
> > +BlockAIOCB *blk_aio_write_zeroes_impl(BlockBackend *blk, int64_t sector_num,
> > +                                      int nb_sectors, BdrvRequestFlags flags,
> > +                                      BlockCompletionFunc *cb, void *opaque)
> >  {
> >      int ret = blk_check_request(blk, sector_num, nb_sectors);
> >      if (ret < 0) {
> > @@ -673,6 +674,13 @@ BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
> >                                   cb, opaque);
> >  }
> >
> > +BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
> > +                                 int nb_sectors, BdrvRequestFlags flags,
> > +                                 BlockCompletionFunc *cb, void *opaque)
> > +{
> > +    return replay_aio_write_zeroes(blk, sector_num, nb_sectors, flags, cb, opaque);
> > +}
> > +
> 
> > +BlockAIOCB *replay_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
> > +                                    int nb_sectors, BdrvRequestFlags flags,
> > +                                    BlockCompletionFunc *cb, void *opaque)
> > +{
> > +    if (replay_mode == REPLAY_MODE_NONE) {
> > +        return blk_aio_write_zeroes_impl(blk, sector_num, nb_sectors, flags, cb, opaque);
> > +    } else {
> > +        ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES,
> > +                                             blk, cb, opaque);
> > +        acb->req.sector = sector_num;
> > +        acb->req.nb_sectors = nb_sectors;
> > +        acb->req.flags = flags;
> > +        replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES, acb, NULL, 0);
> > +
> > +        return &acb->common;
> > +    }
> > +}
> 
> I think it's obvious that adding two functions to the call chain which
> do nothing in the common case is a bit ugly. If we did this for every
> feature that could possibly be enabled, we'd end up with two-kilometer
> stack traces.
> 
> So definitely don't call into replay.c, which just calls back in 99.9%
> of the cases, but if anything, do the check in block-backends.c.

This way seems to be better.

> But even this doesn't feel completely right, because block drivers are
> already layered and there is no need to hardcode something optional (and
> rarely used) in the hot code path that could just be another layer.
> 
> I assume that you know beforehand if you want to replay something, so
> requiring you to configure your block devices with a replay driver on
> top of the stack seems reasonable enough.

I cannot use block drivers for this. When driver functions are used, QEMU
is already used coroutines (and probably started bottom halves).
Coroutines make execution non-deterministic.
That's why we have to intercept blk_aio_ functions, that are called
deterministically.

Pavel Dovgalyuk
Kevin Wolf Feb. 10, 2016, 11:45 a.m. UTC | #3
Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > But even this doesn't feel completely right, because block drivers are
> > already layered and there is no need to hardcode something optional (and
> > rarely used) in the hot code path that could just be another layer.
> > 
> > I assume that you know beforehand if you want to replay something, so
> > requiring you to configure your block devices with a replay driver on
> > top of the stack seems reasonable enough.
> 
> I cannot use block drivers for this. When driver functions are used, QEMU
> is already used coroutines (and probably started bottom halves).
> Coroutines make execution non-deterministic.
> That's why we have to intercept blk_aio_ functions, that are called
> deterministically.

What does "deterministic" mean in this context, i.e. what are your exact
requirements?

I don't think that coroutines introduce anything non-deterministic per
se. Depending on what you mean by it, the block layer code paths in
block.c may contain problematic code.

The block layer uses bottom halves in some cases for request completion,
but not until calling into the first driver (why would they be a
problem?). What could happen is that a request is serialised and
therefore delayed in some special configurations, which sounds a lot
like what you wanted to avoid.

Is there a writeup somewhere of the context in which this code is called
and the rules that need to be considered for replay to work?

Kevin
Pavel Dovgalyuk Feb. 10, 2016, 12:05 p.m. UTC | #4
> -----Original Message-----
> Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > But even this doesn't feel completely right, because block drivers are
> > > already layered and there is no need to hardcode something optional (and
> > > rarely used) in the hot code path that could just be another layer.
> > >
> > > I assume that you know beforehand if you want to replay something, so
> > > requiring you to configure your block devices with a replay driver on
> > > top of the stack seems reasonable enough.
> >
> > I cannot use block drivers for this. When driver functions are used, QEMU
> > is already used coroutines (and probably started bottom halves).
> > Coroutines make execution non-deterministic.
> > That's why we have to intercept blk_aio_ functions, that are called
> > deterministically.
> 
> What does "deterministic" mean in this context, i.e. what are your exact
> requirements?

"Deterministic" means that the replayed execution should run exactly
the same guest instructions in the same sequence, as in recording session.

> I don't think that coroutines introduce anything non-deterministic per
> se. Depending on what you mean by it, the block layer code paths in
> block.c may contain problematic code.

They are non-deterministic if we need instruction-level accuracy.
Thread switching (and therefore callbacks and BH execution) is non-deterministic.
In two different executions these callbacks may happen at different moments of
time (counting in number of executed instructions).
All operations with virtual devices (including memory, interrupt controller,
and disk drive controller) should happen at deterministic moments of time
to be replayable.

> The block layer uses bottom halves in some cases for request completion,
> but not until calling into the first driver (why would they be a
> problem?). What could happen is that a request is serialised and
> therefore delayed in some special configurations, which sounds a lot
> like what you wanted to avoid.

Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm.
First ones have to be deterministic, because they affect guest memory,
virtual disk controller, and interrupts.

> Is there a writeup somewhere of the context in which this code is called
> and the rules that need to be considered for replay to work?

There is some information about it in docs/replay.txt and in
http://wiki.qemu.org/Features/record-replay

Pavel Dovgalyuk
Kevin Wolf Feb. 10, 2016, 12:28 p.m. UTC | #5
Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben:
> > -----Original Message-----
> > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > But even this doesn't feel completely right, because block drivers are
> > > > already layered and there is no need to hardcode something optional (and
> > > > rarely used) in the hot code path that could just be another layer.
> > > >
> > > > I assume that you know beforehand if you want to replay something, so
> > > > requiring you to configure your block devices with a replay driver on
> > > > top of the stack seems reasonable enough.
> > >
> > > I cannot use block drivers for this. When driver functions are used, QEMU
> > > is already used coroutines (and probably started bottom halves).
> > > Coroutines make execution non-deterministic.
> > > That's why we have to intercept blk_aio_ functions, that are called
> > > deterministically.
> > 
> > What does "deterministic" mean in this context, i.e. what are your exact
> > requirements?
> 
> "Deterministic" means that the replayed execution should run exactly
> the same guest instructions in the same sequence, as in recording session.

Okay. I think with this we can do better than what you have now.

> > I don't think that coroutines introduce anything non-deterministic per
> > se. Depending on what you mean by it, the block layer code paths in
> > block.c may contain problematic code.
> 
> They are non-deterministic if we need instruction-level accuracy.
> Thread switching (and therefore callbacks and BH execution) is non-deterministic.

Thread switching depends on an external event (the kernel scheduler
deciding to switch), so agreed, if a thread switch ever influences what
the guest sees, that would be a problem.

Generally, however, callbacks and BHs don't involve a thread switch at
all (BHs can be invoked from a different thread in theory, but we have
very few of those cases and they shouldn't be visible for the guest).
The same is true for coroutines, which are semantically equivalent to
callbacks.

> In two different executions these callbacks may happen at different moments of
> time (counting in number of executed instructions).
> All operations with virtual devices (including memory, interrupt controller,
> and disk drive controller) should happen at deterministic moments of time
> to be replayable.

Right, so let's talk about what this external non-deterministic event
really is.

I think the only thing whose timing is unknown in the block layer is the
completion of I/O requests. This non-determinism comes from the time the
I/O syscalls made by the lowest layer (usually raw-posix) take.

This means that we can add logic to remove the non-determinism at the
point of our choice between raw-posix and the guest device emulation. A
block driver on top is as good as anything else.

While recording, this block driver would just pass the request to next
lower layer (starting a request is deterministic, so it doesn't need to
be logged) and once the request completes it logs it. While replaying,
the completion of requests is delayed until we read it in the log; if we
read it in the log and the request hasn't completed yet, we do a busy
wait for it (while(!completed) aio_poll();).

This model would get rid of the bdrv_drain_all() that you call
everywhere and therefore allow concurrent requests, giving a result that
is much closer to the "normal" behaviour without replay.

> > The block layer uses bottom halves in some cases for request completion,
> > but not until calling into the first driver (why would they be a
> > problem?). What could happen is that a request is serialised and
> > therefore delayed in some special configurations, which sounds a lot
> > like what you wanted to avoid.
> 
> Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm.
> First ones have to be deterministic, because they affect guest memory,
> virtual disk controller, and interrupts.

Sure they can, these are two different callbacks. But even if they
couldn't, making more things than necessary deterministic might be
wasteful, but not really harmful.

> > Is there a writeup somewhere of the context in which this code is called
> > and the rules that need to be considered for replay to work?
> 
> There is some information about it in docs/replay.txt and in
> http://wiki.qemu.org/Features/record-replay

Thanks, that was useful. Especially the explanation of non-deterministic
events in docs/replay.txt seems to be very much in line with what I'm
suggesting, as it made clear that external events are what we should be
looking for.

Kevin
Pavel Dovgalyuk Feb. 10, 2016, 12:51 p.m. UTC | #6
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben:
> > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben:
> > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > But even this doesn't feel completely right, because block drivers are
> > > > > already layered and there is no need to hardcode something optional (and
> > > > > rarely used) in the hot code path that could just be another layer.
> > > > >
> > > > > I assume that you know beforehand if you want to replay something, so
> > > > > requiring you to configure your block devices with a replay driver on
> > > > > top of the stack seems reasonable enough.
> > > >
> > > > I cannot use block drivers for this. When driver functions are used, QEMU
> > > > is already used coroutines (and probably started bottom halves).
> > > > Coroutines make execution non-deterministic.
> > > > That's why we have to intercept blk_aio_ functions, that are called
> > > > deterministically.
> > >
> > > What does "deterministic" mean in this context, i.e. what are your exact
> > > requirements?
> >
> > "Deterministic" means that the replayed execution should run exactly
> > the same guest instructions in the same sequence, as in recording session.
> 
> Okay. I think with this we can do better than what you have now.
> 
> > > I don't think that coroutines introduce anything non-deterministic per
> > > se. Depending on what you mean by it, the block layer code paths in
> > > block.c may contain problematic code.
> >
> > They are non-deterministic if we need instruction-level accuracy.
> > Thread switching (and therefore callbacks and BH execution) is non-deterministic.
> 
> Thread switching depends on an external event (the kernel scheduler
> deciding to switch), so agreed, if a thread switch ever influences what
> the guest sees, that would be a problem.
> 
> Generally, however, callbacks and BHs don't involve a thread switch at
> all (BHs can be invoked from a different thread in theory, but we have
> very few of those cases and they shouldn't be visible for the guest).
> The same is true for coroutines, which are semantically equivalent to
> callbacks.
> 
> > In two different executions these callbacks may happen at different moments of
> > time (counting in number of executed instructions).
> > All operations with virtual devices (including memory, interrupt controller,
> > and disk drive controller) should happen at deterministic moments of time
> > to be replayable.
> 
> Right, so let's talk about what this external non-deterministic event
> really is.
> 
> I think the only thing whose timing is unknown in the block layer is the
> completion of I/O requests. This non-determinism comes from the time the
> I/O syscalls made by the lowest layer (usually raw-posix) take.

Right.

> This means that we can add logic to remove the non-determinism at the
> point of our choice between raw-posix and the guest device emulation. A
> block driver on top is as good as anything else.
> 
> While recording, this block driver would just pass the request to next
> lower layer (starting a request is deterministic, so it doesn't need to
> be logged) and once the request completes it logs it. While replaying,
> the completion of requests is delayed until we read it in the log; if we
> read it in the log and the request hasn't completed yet, we do a busy
> wait for it (while(!completed) aio_poll();).

I tried serializing all bottom halves and worker thread callbacks in
previous version of the patches. That code was much more complicated 
and error-prone than the current version. We had to classify all bottom
halves to recorded and non-recorded (because sometimes they are used
for qemu's purposes, not the guest ones).

However, I don't understand yet which layer do you offer as the candidate
for record/replay? What functions should be changed?
I would like to investigate this way, but I don't got it yet.

> This model would get rid of the bdrv_drain_all() that you call
> everywhere and therefore allow concurrent requests, giving a result that
> is much closer to the "normal" behaviour without replay.
> 
> > > The block layer uses bottom halves in some cases for request completion,
> > > but not until calling into the first driver (why would they be a
> > > problem?). What could happen is that a request is serialised and
> > > therefore delayed in some special configurations, which sounds a lot
> > > like what you wanted to avoid.
> >
> > Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm.
> > First ones have to be deterministic, because they affect guest memory,
> > virtual disk controller, and interrupts.
> 
> Sure they can, these are two different callbacks. But even if they
> couldn't, making more things than necessary deterministic might be
> wasteful, but not really harmful.

Is there any universal way to check this?

Pavel Dovgalyuk
Kevin Wolf Feb. 10, 2016, 1:25 p.m. UTC | #7
Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben:
> > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben:
> > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > But even this doesn't feel completely right, because block drivers are
> > > > > > already layered and there is no need to hardcode something optional (and
> > > > > > rarely used) in the hot code path that could just be another layer.
> > > > > >
> > > > > > I assume that you know beforehand if you want to replay something, so
> > > > > > requiring you to configure your block devices with a replay driver on
> > > > > > top of the stack seems reasonable enough.
> > > > >
> > > > > I cannot use block drivers for this. When driver functions are used, QEMU
> > > > > is already used coroutines (and probably started bottom halves).
> > > > > Coroutines make execution non-deterministic.
> > > > > That's why we have to intercept blk_aio_ functions, that are called
> > > > > deterministically.
> > > >
> > > > What does "deterministic" mean in this context, i.e. what are your exact
> > > > requirements?
> > >
> > > "Deterministic" means that the replayed execution should run exactly
> > > the same guest instructions in the same sequence, as in recording session.
> > 
> > Okay. I think with this we can do better than what you have now.
> > 
> > > > I don't think that coroutines introduce anything non-deterministic per
> > > > se. Depending on what you mean by it, the block layer code paths in
> > > > block.c may contain problematic code.
> > >
> > > They are non-deterministic if we need instruction-level accuracy.
> > > Thread switching (and therefore callbacks and BH execution) is non-deterministic.
> > 
> > Thread switching depends on an external event (the kernel scheduler
> > deciding to switch), so agreed, if a thread switch ever influences what
> > the guest sees, that would be a problem.
> > 
> > Generally, however, callbacks and BHs don't involve a thread switch at
> > all (BHs can be invoked from a different thread in theory, but we have
> > very few of those cases and they shouldn't be visible for the guest).
> > The same is true for coroutines, which are semantically equivalent to
> > callbacks.
> > 
> > > In two different executions these callbacks may happen at different moments of
> > > time (counting in number of executed instructions).
> > > All operations with virtual devices (including memory, interrupt controller,
> > > and disk drive controller) should happen at deterministic moments of time
> > > to be replayable.
> > 
> > Right, so let's talk about what this external non-deterministic event
> > really is.
> > 
> > I think the only thing whose timing is unknown in the block layer is the
> > completion of I/O requests. This non-determinism comes from the time the
> > I/O syscalls made by the lowest layer (usually raw-posix) take.
> 
> Right.
> 
> > This means that we can add logic to remove the non-determinism at the
> > point of our choice between raw-posix and the guest device emulation. A
> > block driver on top is as good as anything else.
> > 
> > While recording, this block driver would just pass the request to next
> > lower layer (starting a request is deterministic, so it doesn't need to
> > be logged) and once the request completes it logs it. While replaying,
> > the completion of requests is delayed until we read it in the log; if we
> > read it in the log and the request hasn't completed yet, we do a busy
> > wait for it (while(!completed) aio_poll();).
> 
> I tried serializing all bottom halves and worker thread callbacks in
> previous version of the patches. That code was much more complicated 
> and error-prone than the current version. We had to classify all bottom
> halves to recorded and non-recorded (because sometimes they are used
> for qemu's purposes, not the guest ones).
> 
> However, I don't understand yet which layer do you offer as the candidate
> for record/replay? What functions should be changed?
> I would like to investigate this way, but I don't got it yet.

At the core, I wouldn't change any existing function, but introduce a
new block driver. You could copy raw_bsd.c for a start and then tweak
it. Leave out functions that you don't want to support, and add the
necessary magic to .bdrv_co_readv/writev.

Something like this (can probably be generalised for more than just
reads as the part after the bdrv_co_reads() call should be the same for
reads, writes and any other request types):

int blkreplay_co_readv()
{
    BlockReplayState *s = bs->opaque;
    int reqid = s->reqid++;

    bdrv_co_readv(bs->file, ...);

    if (mode == record) {
        log(reqid, time);
    } else {
        assert(mode == replay);
        bool *done = req_replayed_list_get(reqid)
        if (done) {
            *done = true;
        } else {
            req_completed_list_insert(reqid, qemu_coroutine_self());
            qemu_coroutine_yield();
        }
    }
}

/* called by replay.c */
int blkreplay_run_event()
{
    if (mode == replay) {
        co = req_completed_list_get(e.reqid);
        if (co) {
            qemu_coroutine_enter(co);
        } else {
            bool done = false;
            req_replayed_list_insert(reqid, &done);
            /* wait synchronously for completion */
            while (!done) {
                aio_poll();
            }
        }
    }
}

Where we could consider changing existing code is that it might be
desirable to automatically put an instance of this block driver on top
of every block device when record/replay is used. If we don't do that,
you need to explicitly specify -drive driver=blkreplay,...

> > This model would get rid of the bdrv_drain_all() that you call
> > everywhere and therefore allow concurrent requests, giving a result that
> > is much closer to the "normal" behaviour without replay.
> > 
> > > > The block layer uses bottom halves in some cases for request completion,
> > > > but not until calling into the first driver (why would they be a
> > > > problem?). What could happen is that a request is serialised and
> > > > therefore delayed in some special configurations, which sounds a lot
> > > > like what you wanted to avoid.
> > >
> > > Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm.
> > > First ones have to be deterministic, because they affect guest memory,
> > > virtual disk controller, and interrupts.
> > 
> > Sure they can, these are two different callbacks. But even if they
> > couldn't, making more things than necessary deterministic might be
> > wasteful, but not really harmful.
> 
> Is there any universal way to check this?

No, the block layer doesn't know its caller. It's only possible in this
specific case because the guest never calls .bdrv_save_vmstate().

But as I said, logging more requests than necessary doesn't really hurt
expect making the log a bit larger.

Kevin
Pavel Dovgalyuk Feb. 10, 2016, 1:33 p.m. UTC | #8
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben:
> > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben:
> > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > > But even this doesn't feel completely right, because block drivers are
> > > > > > > already layered and there is no need to hardcode something optional (and
> > > > > > > rarely used) in the hot code path that could just be another layer.
> > > > > > >
> > > > > > > I assume that you know beforehand if you want to replay something, so
> > > > > > > requiring you to configure your block devices with a replay driver on
> > > > > > > top of the stack seems reasonable enough.
> > > > > >
> > > > > > I cannot use block drivers for this. When driver functions are used, QEMU
> > > > > > is already used coroutines (and probably started bottom halves).
> > > > > > Coroutines make execution non-deterministic.
> > > > > > That's why we have to intercept blk_aio_ functions, that are called
> > > > > > deterministically.
> > > > >
> > > > > What does "deterministic" mean in this context, i.e. what are your exact
> > > > > requirements?
> > > >
> > > > "Deterministic" means that the replayed execution should run exactly
> > > > the same guest instructions in the same sequence, as in recording session.
> > >
> > > Okay. I think with this we can do better than what you have now.
> > >
> > > > > I don't think that coroutines introduce anything non-deterministic per
> > > > > se. Depending on what you mean by it, the block layer code paths in
> > > > > block.c may contain problematic code.
> > > >
> > > > They are non-deterministic if we need instruction-level accuracy.
> > > > Thread switching (and therefore callbacks and BH execution) is non-deterministic.
> > >
> > > Thread switching depends on an external event (the kernel scheduler
> > > deciding to switch), so agreed, if a thread switch ever influences what
> > > the guest sees, that would be a problem.
> > >
> > > Generally, however, callbacks and BHs don't involve a thread switch at
> > > all (BHs can be invoked from a different thread in theory, but we have
> > > very few of those cases and they shouldn't be visible for the guest).
> > > The same is true for coroutines, which are semantically equivalent to
> > > callbacks.
> > >
> > > > In two different executions these callbacks may happen at different moments of
> > > > time (counting in number of executed instructions).
> > > > All operations with virtual devices (including memory, interrupt controller,
> > > > and disk drive controller) should happen at deterministic moments of time
> > > > to be replayable.
> > >
> > > Right, so let's talk about what this external non-deterministic event
> > > really is.
> > >
> > > I think the only thing whose timing is unknown in the block layer is the
> > > completion of I/O requests. This non-determinism comes from the time the
> > > I/O syscalls made by the lowest layer (usually raw-posix) take.
> >
> > Right.
> >
> > > This means that we can add logic to remove the non-determinism at the
> > > point of our choice between raw-posix and the guest device emulation. A
> > > block driver on top is as good as anything else.
> > >
> > > While recording, this block driver would just pass the request to next
> > > lower layer (starting a request is deterministic, so it doesn't need to
> > > be logged) and once the request completes it logs it. While replaying,
> > > the completion of requests is delayed until we read it in the log; if we
> > > read it in the log and the request hasn't completed yet, we do a busy
> > > wait for it (while(!completed) aio_poll();).
> >
> > I tried serializing all bottom halves and worker thread callbacks in
> > previous version of the patches. That code was much more complicated
> > and error-prone than the current version. We had to classify all bottom
> > halves to recorded and non-recorded (because sometimes they are used
> > for qemu's purposes, not the guest ones).
> >
> > However, I don't understand yet which layer do you offer as the candidate
> > for record/replay? What functions should be changed?
> > I would like to investigate this way, but I don't got it yet.
> 
> At the core, I wouldn't change any existing function, but introduce a
> new block driver. You could copy raw_bsd.c for a start and then tweak
> it. Leave out functions that you don't want to support, and add the
> necessary magic to .bdrv_co_readv/writev.
> 
> Something like this (can probably be generalised for more than just
> reads as the part after the bdrv_co_reads() call should be the same for
> reads, writes and any other request types):
> 
> int blkreplay_co_readv()
> {
>     BlockReplayState *s = bs->opaque;
>     int reqid = s->reqid++;
> 
>     bdrv_co_readv(bs->file, ...);
> 
>     if (mode == record) {
>         log(reqid, time);
>     } else {
>         assert(mode == replay);
>         bool *done = req_replayed_list_get(reqid)
>         if (done) {
>             *done = true;
>         } else {
>             req_completed_list_insert(reqid, qemu_coroutine_self());
>             qemu_coroutine_yield();
>         }
>     }
> }
> 
> /* called by replay.c */
> int blkreplay_run_event()
> {
>     if (mode == replay) {
>         co = req_completed_list_get(e.reqid);
>         if (co) {
>             qemu_coroutine_enter(co);
>         } else {
>             bool done = false;
>             req_replayed_list_insert(reqid, &done);
>             /* wait synchronously for completion */
>             while (!done) {
>                 aio_poll();
>             }
>         }
>     }
> }
> 
> Where we could consider changing existing code is that it might be
> desirable to automatically put an instance of this block driver on top
> of every block device when record/replay is used. If we don't do that,
> you need to explicitly specify -drive driver=blkreplay,...

Thank you.

> 
> > > This model would get rid of the bdrv_drain_all() that you call
> > > everywhere and therefore allow concurrent requests, giving a result that
> > > is much closer to the "normal" behaviour without replay.
> > >
> > > > > The block layer uses bottom halves in some cases for request completion,
> > > > > but not until calling into the first driver (why would they be a
> > > > > problem?). What could happen is that a request is serialised and
> > > > > therefore delayed in some special configurations, which sounds a lot
> > > > > like what you wanted to avoid.
> > > >
> > > > Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm.
> > > > First ones have to be deterministic, because they affect guest memory,
> > > > virtual disk controller, and interrupts.
> > >
> > > Sure they can, these are two different callbacks. But even if they
> > > couldn't, making more things than necessary deterministic might be
> > > wasteful, but not really harmful.
> >
> > Is there any universal way to check this?
> 
> No, the block layer doesn't know its caller. It's only possible in this
> specific case because the guest never calls .bdrv_save_vmstate().
> 
> But as I said, logging more requests than necessary doesn't really hurt
> expect making the log a bit larger.

Let's make this clear first.
Logging more request really hurts, because we may want to do some additional
things in record or replay modes.
E.g., we may save some snapshots (in record mode) to be used later.
Saving operations should not be replayed. But in replay mode we may load back
some of these snapshots and replay from that moment.
Or user will issue a command in monitor and it will read some disk information.

That's why I want to record only guest operations.

Pavel Dovgalyuk
Kevin Wolf Feb. 10, 2016, 1:52 p.m. UTC | #9
Am 10.02.2016 um 14:33 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben:
> > > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben:
> > > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > > > But even this doesn't feel completely right, because block drivers are
> > > > > > > > already layered and there is no need to hardcode something optional (and
> > > > > > > > rarely used) in the hot code path that could just be another layer.
> > > > > > > >
> > > > > > > > I assume that you know beforehand if you want to replay something, so
> > > > > > > > requiring you to configure your block devices with a replay driver on
> > > > > > > > top of the stack seems reasonable enough.
> > > > > > >
> > > > > > > I cannot use block drivers for this. When driver functions are used, QEMU
> > > > > > > is already used coroutines (and probably started bottom halves).
> > > > > > > Coroutines make execution non-deterministic.
> > > > > > > That's why we have to intercept blk_aio_ functions, that are called
> > > > > > > deterministically.
> > > > > >
> > > > > > What does "deterministic" mean in this context, i.e. what are your exact
> > > > > > requirements?
> > > > >
> > > > > "Deterministic" means that the replayed execution should run exactly
> > > > > the same guest instructions in the same sequence, as in recording session.
> > > >
> > > > Okay. I think with this we can do better than what you have now.
> > > >
> > > > > > I don't think that coroutines introduce anything non-deterministic per
> > > > > > se. Depending on what you mean by it, the block layer code paths in
> > > > > > block.c may contain problematic code.
> > > > >
> > > > > They are non-deterministic if we need instruction-level accuracy.
> > > > > Thread switching (and therefore callbacks and BH execution) is non-deterministic.
> > > >
> > > > Thread switching depends on an external event (the kernel scheduler
> > > > deciding to switch), so agreed, if a thread switch ever influences what
> > > > the guest sees, that would be a problem.
> > > >
> > > > Generally, however, callbacks and BHs don't involve a thread switch at
> > > > all (BHs can be invoked from a different thread in theory, but we have
> > > > very few of those cases and they shouldn't be visible for the guest).
> > > > The same is true for coroutines, which are semantically equivalent to
> > > > callbacks.
> > > >
> > > > > In two different executions these callbacks may happen at different moments of
> > > > > time (counting in number of executed instructions).
> > > > > All operations with virtual devices (including memory, interrupt controller,
> > > > > and disk drive controller) should happen at deterministic moments of time
> > > > > to be replayable.
> > > >
> > > > Right, so let's talk about what this external non-deterministic event
> > > > really is.
> > > >
> > > > I think the only thing whose timing is unknown in the block layer is the
> > > > completion of I/O requests. This non-determinism comes from the time the
> > > > I/O syscalls made by the lowest layer (usually raw-posix) take.
> > >
> > > Right.
> > >
> > > > This means that we can add logic to remove the non-determinism at the
> > > > point of our choice between raw-posix and the guest device emulation. A
> > > > block driver on top is as good as anything else.
> > > >
> > > > While recording, this block driver would just pass the request to next
> > > > lower layer (starting a request is deterministic, so it doesn't need to
> > > > be logged) and once the request completes it logs it. While replaying,
> > > > the completion of requests is delayed until we read it in the log; if we
> > > > read it in the log and the request hasn't completed yet, we do a busy
> > > > wait for it (while(!completed) aio_poll();).
> > >
> > > I tried serializing all bottom halves and worker thread callbacks in
> > > previous version of the patches. That code was much more complicated
> > > and error-prone than the current version. We had to classify all bottom
> > > halves to recorded and non-recorded (because sometimes they are used
> > > for qemu's purposes, not the guest ones).
> > >
> > > However, I don't understand yet which layer do you offer as the candidate
> > > for record/replay? What functions should be changed?
> > > I would like to investigate this way, but I don't got it yet.
> > 
> > At the core, I wouldn't change any existing function, but introduce a
> > new block driver. You could copy raw_bsd.c for a start and then tweak
> > it. Leave out functions that you don't want to support, and add the
> > necessary magic to .bdrv_co_readv/writev.
> > 
> > Something like this (can probably be generalised for more than just
> > reads as the part after the bdrv_co_reads() call should be the same for
> > reads, writes and any other request types):
> > 
> > int blkreplay_co_readv()
> > {
> >     BlockReplayState *s = bs->opaque;
> >     int reqid = s->reqid++;
> > 
> >     bdrv_co_readv(bs->file, ...);
> > 
> >     if (mode == record) {
> >         log(reqid, time);
> >     } else {
> >         assert(mode == replay);
> >         bool *done = req_replayed_list_get(reqid)
> >         if (done) {
> >             *done = true;
> >         } else {
> >             req_completed_list_insert(reqid, qemu_coroutine_self());
> >             qemu_coroutine_yield();
> >         }
> >     }
> > }
> > 
> > /* called by replay.c */
> > int blkreplay_run_event()
> > {
> >     if (mode == replay) {
> >         co = req_completed_list_get(e.reqid);
> >         if (co) {
> >             qemu_coroutine_enter(co);
> >         } else {
> >             bool done = false;
> >             req_replayed_list_insert(reqid, &done);
> >             /* wait synchronously for completion */
> >             while (!done) {
> >                 aio_poll();
> >             }
> >         }
> >     }
> > }
> > 
> > Where we could consider changing existing code is that it might be
> > desirable to automatically put an instance of this block driver on top
> > of every block device when record/replay is used. If we don't do that,
> > you need to explicitly specify -drive driver=blkreplay,...
> 
> Thank you.
> 
> > 
> > > > This model would get rid of the bdrv_drain_all() that you call
> > > > everywhere and therefore allow concurrent requests, giving a result that
> > > > is much closer to the "normal" behaviour without replay.
> > > >
> > > > > > The block layer uses bottom halves in some cases for request completion,
> > > > > > but not until calling into the first driver (why would they be a
> > > > > > problem?). What could happen is that a request is serialised and
> > > > > > therefore delayed in some special configurations, which sounds a lot
> > > > > > like what you wanted to avoid.
> > > > >
> > > > > Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm.
> > > > > First ones have to be deterministic, because they affect guest memory,
> > > > > virtual disk controller, and interrupts.
> > > >
> > > > Sure they can, these are two different callbacks. But even if they
> > > > couldn't, making more things than necessary deterministic might be
> > > > wasteful, but not really harmful.
> > >
> > > Is there any universal way to check this?
> > 
> > No, the block layer doesn't know its caller. It's only possible in this
> > specific case because the guest never calls .bdrv_save_vmstate().
> > 
> > But as I said, logging more requests than necessary doesn't really hurt
> > expect making the log a bit larger.
> 
> Let's make this clear first.
> Logging more request really hurts, because we may want to do some additional
> things in record or replay modes.
> E.g., we may save some snapshots (in record mode) to be used later.
> Saving operations should not be replayed. But in replay mode we may load back
> some of these snapshots and replay from that moment.
> Or user will issue a command in monitor and it will read some disk information.
> 
> That's why I want to record only guest operations.

Understood. As I said, for savevm/loadvm we are okay because they don't
use the same functions as guest I/O anyway.

There are a few more advanced block layer features (like block jobs or
the built-in NBD server) that produce requests even though they are not
in the guest. The solution for that is to target them at the layer below
the blkreplay driver, which is identifiable by its own node-name. This
is not possible in all cases yet, but we are actively working on
allowing this. I expect that using such features together with record
and replay is highly unusual anyway.

Kevin
Pavel Dovgalyuk Feb. 11, 2016, 6:05 a.m. UTC | #10
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben:
> > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben:
> > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > > But even this doesn't feel completely right, because block drivers are
> > > > > > > already layered and there is no need to hardcode something optional (and
> > > > > > > rarely used) in the hot code path that could just be another layer.
> > > > > > >
> > > > > > > I assume that you know beforehand if you want to replay something, so
> > > > > > > requiring you to configure your block devices with a replay driver on
> > > > > > > top of the stack seems reasonable enough.
> > > > > >
> > > > > > I cannot use block drivers for this. When driver functions are used, QEMU
> > > > > > is already used coroutines (and probably started bottom halves).
> > > > > > Coroutines make execution non-deterministic.
> > > > > > That's why we have to intercept blk_aio_ functions, that are called
> > > > > > deterministically.
> > > > >
> > > > > What does "deterministic" mean in this context, i.e. what are your exact
> > > > > requirements?
> > > >
> > > > "Deterministic" means that the replayed execution should run exactly
> > > > the same guest instructions in the same sequence, as in recording session.
> > >
> > > Okay. I think with this we can do better than what you have now.
> > >
> > > > > I don't think that coroutines introduce anything non-deterministic per
> > > > > se. Depending on what you mean by it, the block layer code paths in
> > > > > block.c may contain problematic code.
> > > >
> > > > They are non-deterministic if we need instruction-level accuracy.
> > > > Thread switching (and therefore callbacks and BH execution) is non-deterministic.
> > >
> > > Thread switching depends on an external event (the kernel scheduler
> > > deciding to switch), so agreed, if a thread switch ever influences what
> > > the guest sees, that would be a problem.
> > >
> > > Generally, however, callbacks and BHs don't involve a thread switch at
> > > all (BHs can be invoked from a different thread in theory, but we have
> > > very few of those cases and they shouldn't be visible for the guest).
> > > The same is true for coroutines, which are semantically equivalent to
> > > callbacks.
> > >
> > > > In two different executions these callbacks may happen at different moments of
> > > > time (counting in number of executed instructions).
> > > > All operations with virtual devices (including memory, interrupt controller,
> > > > and disk drive controller) should happen at deterministic moments of time
> > > > to be replayable.
> > >
> > > Right, so let's talk about what this external non-deterministic event
> > > really is.
> > >
> > > I think the only thing whose timing is unknown in the block layer is the
> > > completion of I/O requests. This non-determinism comes from the time the
> > > I/O syscalls made by the lowest layer (usually raw-posix) take.
> >
> > Right.
> >
> > > This means that we can add logic to remove the non-determinism at the
> > > point of our choice between raw-posix and the guest device emulation. A
> > > block driver on top is as good as anything else.
> > >
> > > While recording, this block driver would just pass the request to next
> > > lower layer (starting a request is deterministic, so it doesn't need to
> > > be logged) and once the request completes it logs it. While replaying,
> > > the completion of requests is delayed until we read it in the log; if we
> > > read it in the log and the request hasn't completed yet, we do a busy
> > > wait for it (while(!completed) aio_poll();).
> >
> > I tried serializing all bottom halves and worker thread callbacks in
> > previous version of the patches. That code was much more complicated
> > and error-prone than the current version. We had to classify all bottom
> > halves to recorded and non-recorded (because sometimes they are used
> > for qemu's purposes, not the guest ones).
> >
> > However, I don't understand yet which layer do you offer as the candidate
> > for record/replay? What functions should be changed?
> > I would like to investigate this way, but I don't got it yet.
> 
> At the core, I wouldn't change any existing function, but introduce a
> new block driver. You could copy raw_bsd.c for a start and then tweak
> it. Leave out functions that you don't want to support, and add the
> necessary magic to .bdrv_co_readv/writev.
> 
> Something like this (can probably be generalised for more than just
> reads as the part after the bdrv_co_reads() call should be the same for
> reads, writes and any other request types):
> 
> int blkreplay_co_readv()
> {
>     BlockReplayState *s = bs->opaque;
>     int reqid = s->reqid++;
> 
>     bdrv_co_readv(bs->file, ...);
> 
>     if (mode == record) {
>         log(reqid, time);
>     } else {
>         assert(mode == replay);
>         bool *done = req_replayed_list_get(reqid)
>         if (done) {
>             *done = true;
>         } else {
>             req_completed_list_insert(reqid, qemu_coroutine_self());
>             qemu_coroutine_yield();
>         }
>     }
> }
> 
> /* called by replay.c */
> int blkreplay_run_event()
> {
>     if (mode == replay) {
>         co = req_completed_list_get(e.reqid);
>         if (co) {
>             qemu_coroutine_enter(co);
>         } else {
>             bool done = false;
>             req_replayed_list_insert(reqid, &done);
>             /* wait synchronously for completion */
>             while (!done) {
>                 aio_poll();
>             }
>         }
>     }
> }
> 
> Where we could consider changing existing code is that it might be
> desirable to automatically put an instance of this block driver on top
> of every block device when record/replay is used. If we don't do that,
> you need to explicitly specify -drive driver=blkreplay,...

As far, as I understand, all synchronous read/write request are also passed
through this coroutines layer.
It means that every disk access in replay phase should match the recording phase.

Record/replay is intended to be used for debugging and analysis.
When execution is replayed, guest machine cannot notice analysis overhead.
Some analysis methods may include disk image reading. E.g., qemu-based
analysis framework DECAF uses sleuthkit for disk forensics ( https://github.com/sycurelab/DECAF ).
If similar framework will be used with replay, forensics disk access operations
won't work if we will record/replay the coroutines.

And if we'll record only high-level asynchronous disk operations, then
there will be a way to performs disk accesses without matching these
operations with replay log.

However, we already tried to record disk operations completion events
(in previous more complicated version). Recording overhead was the same
as with new blk_aio_ version of the patch. 

Pavel Dovgalyuk
Kevin Wolf Feb. 11, 2016, 9:43 a.m. UTC | #11
Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > However, I don't understand yet which layer do you offer as the candidate
> > > for record/replay? What functions should be changed?
> > > I would like to investigate this way, but I don't got it yet.
> > 
> > At the core, I wouldn't change any existing function, but introduce a
> > new block driver. You could copy raw_bsd.c for a start and then tweak
> > it. Leave out functions that you don't want to support, and add the
> > necessary magic to .bdrv_co_readv/writev.
> > 
> > Something like this (can probably be generalised for more than just
> > reads as the part after the bdrv_co_reads() call should be the same for
> > reads, writes and any other request types):
> > 
> > int blkreplay_co_readv()
> > {
> >     BlockReplayState *s = bs->opaque;
> >     int reqid = s->reqid++;
> > 
> >     bdrv_co_readv(bs->file, ...);
> > 
> >     if (mode == record) {
> >         log(reqid, time);
> >     } else {
> >         assert(mode == replay);
> >         bool *done = req_replayed_list_get(reqid)
> >         if (done) {
> >             *done = true;
> >         } else {
> >             req_completed_list_insert(reqid, qemu_coroutine_self());
> >             qemu_coroutine_yield();
> >         }
> >     }
> > }
> > 
> > /* called by replay.c */
> > int blkreplay_run_event()
> > {
> >     if (mode == replay) {
> >         co = req_completed_list_get(e.reqid);
> >         if (co) {
> >             qemu_coroutine_enter(co);
> >         } else {
> >             bool done = false;
> >             req_replayed_list_insert(reqid, &done);
> >             /* wait synchronously for completion */
> >             while (!done) {
> >                 aio_poll();
> >             }
> >         }
> >     }
> > }
> > 
> > Where we could consider changing existing code is that it might be
> > desirable to automatically put an instance of this block driver on top
> > of every block device when record/replay is used. If we don't do that,
> > you need to explicitly specify -drive driver=blkreplay,...
> 
> As far, as I understand, all synchronous read/write request are also passed
> through this coroutines layer.

Yes, all read/write requests go through the same function internally, no
matter which external interface was used.

> It means that every disk access in replay phase should match the recording phase.

Right. If I'm not mistaken, this was the fundamental requirement you
have, so I wouldn't have suggested this otherwise.

> Record/replay is intended to be used for debugging and analysis.
> When execution is replayed, guest machine cannot notice analysis overhead.
> Some analysis methods may include disk image reading. E.g., qemu-based
> analysis framework DECAF uses sleuthkit for disk forensics ( https://github.com/sycurelab/DECAF ).
> If similar framework will be used with replay, forensics disk access operations
> won't work if we will record/replay the coroutines.

Sorry, I'm not sure if I can follow.

If such analysis software runs in the guest, it's not a replay any more
and I completely fail to see what you're doing.

If it's a qemu component independent from the guest, then my method
gives you a clean way to bypass the replay driver that wouldn't be
possible with yours.

If your plan was to record/replay only async requests and then use sync
requests to bypass the record/replay, let me clearly state that this is
the wrong approach: There are still guest devices which do synchronous
I/O and need to be considered in the replay log, and you shouldn't
prevent the analysis code from using AIO (in fact, using sync I/O in new
code is very much frowned upon).

I can explain in more detail what the block device structure looks like
and how to access an image with and without record/replay, but first let
me please know whether I guessed right what your problem is. Or if I
missed your point, can you please describe in detail a case that
wouldn't work?

> And if we'll record only high-level asynchronous disk operations, then
> there will be a way to performs disk accesses without matching these
> operations with replay log.
> 
> However, we already tried to record disk operations completion events
> (in previous more complicated version). Recording overhead was the same
> as with new blk_aio_ version of the patch. 

Well, if your code really intentionally ignores synchronous requests,
then the advantage you get from my approach is being correct.

Other than that, I'm personally mainly interested in keeping the core
block layer easily maintainable and the hot I/O paths clean for cases
that don't use all the features qemu has. This means that in the mid
term existing features need to move out of the core block layer rather
than new ones being added. So doing things in a separate block driver
might not be an advantage for your immediate needs, but it's important
for clean modularisation and a requirement for things to get merged
(unless you can prove that it's impossible to do this way, but I don't
think you can - otherwise the whole block layer design would be wrong).

Kevin
Pavel Dovgalyuk Feb. 11, 2016, 11 a.m. UTC | #12
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > > However, I don't understand yet which layer do you offer as the candidate
> > > > for record/replay? What functions should be changed?
> > > > I would like to investigate this way, but I don't got it yet.
> > >
> > > At the core, I wouldn't change any existing function, but introduce a
> > > new block driver. You could copy raw_bsd.c for a start and then tweak
> > > it. Leave out functions that you don't want to support, and add the
> > > necessary magic to .bdrv_co_readv/writev.
> > >
> > > Something like this (can probably be generalised for more than just
> > > reads as the part after the bdrv_co_reads() call should be the same for
> > > reads, writes and any other request types):
> > >
> > > int blkreplay_co_readv()
> > > {
> > >     BlockReplayState *s = bs->opaque;
> > >     int reqid = s->reqid++;
> > >
> > >     bdrv_co_readv(bs->file, ...);
> > >
> > >     if (mode == record) {
> > >         log(reqid, time);
> > >     } else {
> > >         assert(mode == replay);
> > >         bool *done = req_replayed_list_get(reqid)
> > >         if (done) {
> > >             *done = true;
> > >         } else {
> > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > >             qemu_coroutine_yield();
> > >         }
> > >     }
> > > }
> > >
> > > /* called by replay.c */
> > > int blkreplay_run_event()
> > > {
> > >     if (mode == replay) {
> > >         co = req_completed_list_get(e.reqid);
> > >         if (co) {
> > >             qemu_coroutine_enter(co);
> > >         } else {
> > >             bool done = false;
> > >             req_replayed_list_insert(reqid, &done);
> > >             /* wait synchronously for completion */
> > >             while (!done) {
> > >                 aio_poll();
> > >             }
> > >         }
> > >     }
> > > }
> > >
> > > Where we could consider changing existing code is that it might be
> > > desirable to automatically put an instance of this block driver on top
> > > of every block device when record/replay is used. If we don't do that,
> > > you need to explicitly specify -drive driver=blkreplay,...
> >
> > As far, as I understand, all synchronous read/write request are also passed
> > through this coroutines layer.
> 
> Yes, all read/write requests go through the same function internally, no
> matter which external interface was used.
> 
> > It means that every disk access in replay phase should match the recording phase.
> 
> Right. If I'm not mistaken, this was the fundamental requirement you
> have, so I wouldn't have suggested this otherwise.
> 
> > Record/replay is intended to be used for debugging and analysis.
> > When execution is replayed, guest machine cannot notice analysis overhead.
> > Some analysis methods may include disk image reading. E.g., qemu-based
> > analysis framework DECAF uses sleuthkit for disk forensics (
> https://github.com/sycurelab/DECAF ).
> > If similar framework will be used with replay, forensics disk access operations
> > won't work if we will record/replay the coroutines.
> 
> Sorry, I'm not sure if I can follow.
> 
> If such analysis software runs in the guest, it's not a replay any more
> and I completely fail to see what you're doing.
> 
> If it's a qemu component independent from the guest, then my method
> gives you a clean way to bypass the replay driver that wouldn't be
> possible with yours.

The second one. qemu may be extended with some components that
perform guest introspection.

> If your plan was to record/replay only async requests and then use sync
> requests to bypass the record/replay, let me clearly state that this is
> the wrong approach: There are still guest devices which do synchronous
> I/O and need to be considered in the replay log, and you shouldn't
> prevent the analysis code from using AIO (in fact, using sync I/O in new
> code is very much frowned upon).

Why do guest synchronous requests have to be recorded?
Aren't they completely deterministic?

> I can explain in more detail what the block device structure looks like
> and how to access an image with and without record/replay, but first let
> me please know whether I guessed right what your problem is. Or if I
> missed your point, can you please describe in detail a case that
> wouldn't work?

You have understood it correctly.
And what is the solution for bypassing one of the layers from component that
should not affect the replay?

> > And if we'll record only high-level asynchronous disk operations, then
> > there will be a way to performs disk accesses without matching these
> > operations with replay log.
> >
> > However, we already tried to record disk operations completion events
> > (in previous more complicated version). Recording overhead was the same
> > as with new blk_aio_ version of the patch.
> 
> Well, if your code really intentionally ignores synchronous requests,
> then the advantage you get from my approach is being correct.
> 
> Other than that, I'm personally mainly interested in keeping the core
> block layer easily maintainable and the hot I/O paths clean for cases
> that don't use all the features qemu has. This means that in the mid
> term existing features need to move out of the core block layer rather
> than new ones being added. So doing things in a separate block driver
> might not be an advantage for your immediate needs, but it's important
> for clean modularisation and a requirement for things to get merged
> (unless you can prove that it's impossible to do this way, but I don't
> think you can - otherwise the whole block layer design would be wrong).

Ok, let's try to find the solution with a new driver.

Pavel Dovgalyuk
Kevin Wolf Feb. 11, 2016, 12:18 p.m. UTC | #13
Am 11.02.2016 um 12:00 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > > > However, I don't understand yet which layer do you offer as the candidate
> > > > > for record/replay? What functions should be changed?
> > > > > I would like to investigate this way, but I don't got it yet.
> > > >
> > > > At the core, I wouldn't change any existing function, but introduce a
> > > > new block driver. You could copy raw_bsd.c for a start and then tweak
> > > > it. Leave out functions that you don't want to support, and add the
> > > > necessary magic to .bdrv_co_readv/writev.
> > > >
> > > > Something like this (can probably be generalised for more than just
> > > > reads as the part after the bdrv_co_reads() call should be the same for
> > > > reads, writes and any other request types):
> > > >
> > > > int blkreplay_co_readv()
> > > > {
> > > >     BlockReplayState *s = bs->opaque;
> > > >     int reqid = s->reqid++;
> > > >
> > > >     bdrv_co_readv(bs->file, ...);
> > > >
> > > >     if (mode == record) {
> > > >         log(reqid, time);
> > > >     } else {
> > > >         assert(mode == replay);
> > > >         bool *done = req_replayed_list_get(reqid)
> > > >         if (done) {
> > > >             *done = true;
> > > >         } else {
> > > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > > >             qemu_coroutine_yield();
> > > >         }
> > > >     }
> > > > }
> > > >
> > > > /* called by replay.c */
> > > > int blkreplay_run_event()
> > > > {
> > > >     if (mode == replay) {
> > > >         co = req_completed_list_get(e.reqid);
> > > >         if (co) {
> > > >             qemu_coroutine_enter(co);
> > > >         } else {
> > > >             bool done = false;
> > > >             req_replayed_list_insert(reqid, &done);
> > > >             /* wait synchronously for completion */
> > > >             while (!done) {
> > > >                 aio_poll();
> > > >             }
> > > >         }
> > > >     }
> > > > }
> > > >
> > > > Where we could consider changing existing code is that it might be
> > > > desirable to automatically put an instance of this block driver on top
> > > > of every block device when record/replay is used. If we don't do that,
> > > > you need to explicitly specify -drive driver=blkreplay,...
> > >
> > > As far, as I understand, all synchronous read/write request are also passed
> > > through this coroutines layer.
> > 
> > Yes, all read/write requests go through the same function internally, no
> > matter which external interface was used.
> > 
> > > It means that every disk access in replay phase should match the recording phase.
> > 
> > Right. If I'm not mistaken, this was the fundamental requirement you
> > have, so I wouldn't have suggested this otherwise.
> > 
> > > Record/replay is intended to be used for debugging and analysis.
> > > When execution is replayed, guest machine cannot notice analysis overhead.
> > > Some analysis methods may include disk image reading. E.g., qemu-based
> > > analysis framework DECAF uses sleuthkit for disk forensics (
> > https://github.com/sycurelab/DECAF ).
> > > If similar framework will be used with replay, forensics disk access operations
> > > won't work if we will record/replay the coroutines.
> > 
> > Sorry, I'm not sure if I can follow.
> > 
> > If such analysis software runs in the guest, it's not a replay any more
> > and I completely fail to see what you're doing.
> > 
> > If it's a qemu component independent from the guest, then my method
> > gives you a clean way to bypass the replay driver that wouldn't be
> > possible with yours.
> 
> The second one. qemu may be extended with some components that
> perform guest introspection.
> 
> > If your plan was to record/replay only async requests and then use sync
> > requests to bypass the record/replay, let me clearly state that this is
> > the wrong approach: There are still guest devices which do synchronous
> > I/O and need to be considered in the replay log, and you shouldn't
> > prevent the analysis code from using AIO (in fact, using sync I/O in new
> > code is very much frowned upon).
> 
> Why do guest synchronous requests have to be recorded?
> Aren't they completely deterministic?

Good point. I think you're right in practice. In theory, with dataplane
(i.e. when running the request in a separate thread) it could happen,
but I guess that isn't very compatible with replay anyway - and at the
first sight I couldn't see it performing synchronous requests either.

> > I can explain in more detail what the block device structure looks like
> > and how to access an image with and without record/replay, but first let
> > me please know whether I guessed right what your problem is. Or if I
> > missed your point, can you please describe in detail a case that
> > wouldn't work?
> 
> You have understood it correctly.
> And what is the solution for bypassing one of the layers from component that
> should not affect the replay?

For this, you need to understand how block drivers are stacked in qemu.
Each driver in the stack has a separate struct BlockDriverState, which
can be used to access its data. You could hook up things like this:

      virtio-blk              NBD server
    --------------           ------------
          |                        |
          v                        |
    +------------+                 |
    | blkreplay  |                 |
    +------------+                 |
          |                        |
          v                        |
    +------------+                 |
    |   qcow2    | <---------------+
    +------------+
          |
          v
    +------------+
    | raw-posix  |
    +------------+
          |
          v
    --------------
      filesystem

As you see, what I've chosen for the external analysis interface is just
an NBD server as this is the component that we already have today. You
could hook up any other (new) code there; the important part is that it
doesn't work on the BDS of the blkreplay driver, but directly on the BDS
of the qcow2 driver.

On the command line, it could look like this (this assumes that we don't
add syntactic sugar that creates the blkreplay part automatically - we
can always do that):

    -drive file=image.qcow2,if=none,id=img-direct
    -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
    -device virtio-blk-pci,drive=img-blkreplay

The NBD server can't be started on the command line, so you'd go to the
monitor and start it there with the direct access:

    (qemu) nbd_server_start unix:/tmp/my_socket
    (qemu) nbd_server_add img-direct

(Exact syntax is untested, but this should roughly be how it works.)

Kevin
Pavel Dovgalyuk Feb. 11, 2016, 12:24 p.m. UTC | #14
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 11.02.2016 um 12:00 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben:
> > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > > > > However, I don't understand yet which layer do you offer as the candidate
> > > > > > for record/replay? What functions should be changed?
> > > > > > I would like to investigate this way, but I don't got it yet.
> > > > >
> > > > > At the core, I wouldn't change any existing function, but introduce a
> > > > > new block driver. You could copy raw_bsd.c for a start and then tweak
> > > > > it. Leave out functions that you don't want to support, and add the
> > > > > necessary magic to .bdrv_co_readv/writev.
> > > > >
> > > > > Something like this (can probably be generalised for more than just
> > > > > reads as the part after the bdrv_co_reads() call should be the same for
> > > > > reads, writes and any other request types):
> > > > >
> > > > > int blkreplay_co_readv()
> > > > > {
> > > > >     BlockReplayState *s = bs->opaque;
> > > > >     int reqid = s->reqid++;
> > > > >
> > > > >     bdrv_co_readv(bs->file, ...);
> > > > >
> > > > >     if (mode == record) {
> > > > >         log(reqid, time);
> > > > >     } else {
> > > > >         assert(mode == replay);
> > > > >         bool *done = req_replayed_list_get(reqid)
> > > > >         if (done) {
> > > > >             *done = true;
> > > > >         } else {
> > > > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > > > >             qemu_coroutine_yield();
> > > > >         }
> > > > >     }
> > > > > }
> > > > >
> > > > > /* called by replay.c */
> > > > > int blkreplay_run_event()
> > > > > {
> > > > >     if (mode == replay) {
> > > > >         co = req_completed_list_get(e.reqid);
> > > > >         if (co) {
> > > > >             qemu_coroutine_enter(co);
> > > > >         } else {
> > > > >             bool done = false;
> > > > >             req_replayed_list_insert(reqid, &done);
> > > > >             /* wait synchronously for completion */
> > > > >             while (!done) {
> > > > >                 aio_poll();
> > > > >             }
> > > > >         }
> > > > >     }
> > > > > }
> > > > >
> > > > > Where we could consider changing existing code is that it might be
> > > > > desirable to automatically put an instance of this block driver on top
> > > > > of every block device when record/replay is used. If we don't do that,
> > > > > you need to explicitly specify -drive driver=blkreplay,...
> > > >
> > > > As far, as I understand, all synchronous read/write request are also passed
> > > > through this coroutines layer.
> > >
> > > Yes, all read/write requests go through the same function internally, no
> > > matter which external interface was used.
> > >
> > > > It means that every disk access in replay phase should match the recording phase.
> > >
> > > Right. If I'm not mistaken, this was the fundamental requirement you
> > > have, so I wouldn't have suggested this otherwise.
> > >
> > > > Record/replay is intended to be used for debugging and analysis.
> > > > When execution is replayed, guest machine cannot notice analysis overhead.
> > > > Some analysis methods may include disk image reading. E.g., qemu-based
> > > > analysis framework DECAF uses sleuthkit for disk forensics (
> > > https://github.com/sycurelab/DECAF ).
> > > > If similar framework will be used with replay, forensics disk access operations
> > > > won't work if we will record/replay the coroutines.
> > >
> > > Sorry, I'm not sure if I can follow.
> > >
> > > If such analysis software runs in the guest, it's not a replay any more
> > > and I completely fail to see what you're doing.
> > >
> > > If it's a qemu component independent from the guest, then my method
> > > gives you a clean way to bypass the replay driver that wouldn't be
> > > possible with yours.
> >
> > The second one. qemu may be extended with some components that
> > perform guest introspection.
> >
> > > If your plan was to record/replay only async requests and then use sync
> > > requests to bypass the record/replay, let me clearly state that this is
> > > the wrong approach: There are still guest devices which do synchronous
> > > I/O and need to be considered in the replay log, and you shouldn't
> > > prevent the analysis code from using AIO (in fact, using sync I/O in new
> > > code is very much frowned upon).
> >
> > Why do guest synchronous requests have to be recorded?
> > Aren't they completely deterministic?
> 
> Good point. I think you're right in practice. In theory, with dataplane
> (i.e. when running the request in a separate thread) it could happen,
> but I guess that isn't very compatible with replay anyway - and at the
> first sight I couldn't see it performing synchronous requests either.
> 
> > > I can explain in more detail what the block device structure looks like
> > > and how to access an image with and without record/replay, but first let
> > > me please know whether I guessed right what your problem is. Or if I
> > > missed your point, can you please describe in detail a case that
> > > wouldn't work?
> >
> > You have understood it correctly.
> > And what is the solution for bypassing one of the layers from component that
> > should not affect the replay?
> 
> For this, you need to understand how block drivers are stacked in qemu.
> Each driver in the stack has a separate struct BlockDriverState, which
> can be used to access its data. You could hook up things like this:
> 
>       virtio-blk              NBD server
>     --------------           ------------
>           |                        |
>           v                        |
>     +------------+                 |
>     | blkreplay  |                 |
>     +------------+                 |
>           |                        |
>           v                        |
>     +------------+                 |
>     |   qcow2    | <---------------+
>     +------------+
>           |
>           v
>     +------------+
>     | raw-posix  |
>     +------------+
>           |
>           v
>     --------------
>       filesystem
> 
> As you see, what I've chosen for the external analysis interface is just
> an NBD server as this is the component that we already have today. You
> could hook up any other (new) code there; the important part is that it
> doesn't work on the BDS of the blkreplay driver, but directly on the BDS
> of the qcow2 driver.
> 
> On the command line, it could look like this (this assumes that we don't
> add syntactic sugar that creates the blkreplay part automatically - we
> can always do that):
> 
>     -drive file=image.qcow2,if=none,id=img-direct
>     -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
>     -device virtio-blk-pci,drive=img-blkreplay
> 
> The NBD server can't be started on the command line, so you'd go to the
> monitor and start it there with the direct access:
> 
>     (qemu) nbd_server_start unix:/tmp/my_socket
>     (qemu) nbd_server_add img-direct
> 
> (Exact syntax is untested, but this should roughly be how it works.)

Thank you!

I'll try this approach and come back either with patches or new questions.

Pavel Dovgalyuk
Pavel Dovgalyuk Feb. 12, 2016, 8:33 a.m. UTC | #15
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 11.02.2016 um 12:00 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben:
> > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > > > > However, I don't understand yet which layer do you offer as the candidate
> > > > > > for record/replay? What functions should be changed?
> > > > > > I would like to investigate this way, but I don't got it yet.
> > > > >
> > > > > At the core, I wouldn't change any existing function, but introduce a
> > > > > new block driver. You could copy raw_bsd.c for a start and then tweak
> > > > > it. Leave out functions that you don't want to support, and add the
> > > > > necessary magic to .bdrv_co_readv/writev.
> > > > >
> > > > > Something like this (can probably be generalised for more than just
> > > > > reads as the part after the bdrv_co_reads() call should be the same for
> > > > > reads, writes and any other request types):
> > > > >
> > > > > int blkreplay_co_readv()
> > > > > {
> > > > >     BlockReplayState *s = bs->opaque;
> > > > >     int reqid = s->reqid++;
> > > > >
> > > > >     bdrv_co_readv(bs->file, ...);
> > > > >
> > > > >     if (mode == record) {
> > > > >         log(reqid, time);
> > > > >     } else {
> > > > >         assert(mode == replay);
> > > > >         bool *done = req_replayed_list_get(reqid)
> > > > >         if (done) {
> > > > >             *done = true;
> > > > >         } else {
> > > > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > > > >             qemu_coroutine_yield();
> > > > >         }
> > > > >     }
> > > > > }
> > > > >
> > > > > /* called by replay.c */
> > > > > int blkreplay_run_event()
> > > > > {
> > > > >     if (mode == replay) {
> > > > >         co = req_completed_list_get(e.reqid);
> > > > >         if (co) {
> > > > >             qemu_coroutine_enter(co);
> > > > >         } else {
> > > > >             bool done = false;
> > > > >             req_replayed_list_insert(reqid, &done);
> > > > >             /* wait synchronously for completion */
> > > > >             while (!done) {
> > > > >                 aio_poll();
> > > > >             }
> > > > >         }
> > > > >     }
> > > > > }
> > > > >
> > > > > Where we could consider changing existing code is that it might be
> > > > > desirable to automatically put an instance of this block driver on top
> > > > > of every block device when record/replay is used. If we don't do that,
> > > > > you need to explicitly specify -drive driver=blkreplay,...
> > > >
> > > > As far, as I understand, all synchronous read/write request are also passed
> > > > through this coroutines layer.
> > >
> > > Yes, all read/write requests go through the same function internally, no
> > > matter which external interface was used.
> > >
> > > > It means that every disk access in replay phase should match the recording phase.
> > >
> > > Right. If I'm not mistaken, this was the fundamental requirement you
> > > have, so I wouldn't have suggested this otherwise.
> > >
> > > > Record/replay is intended to be used for debugging and analysis.
> > > > When execution is replayed, guest machine cannot notice analysis overhead.
> > > > Some analysis methods may include disk image reading. E.g., qemu-based
> > > > analysis framework DECAF uses sleuthkit for disk forensics (
> > > https://github.com/sycurelab/DECAF ).
> > > > If similar framework will be used with replay, forensics disk access operations
> > > > won't work if we will record/replay the coroutines.
> > >
> > > Sorry, I'm not sure if I can follow.
> > >
> > > If such analysis software runs in the guest, it's not a replay any more
> > > and I completely fail to see what you're doing.
> > >
> > > If it's a qemu component independent from the guest, then my method
> > > gives you a clean way to bypass the replay driver that wouldn't be
> > > possible with yours.
> >
> > The second one. qemu may be extended with some components that
> > perform guest introspection.
> >
> > > If your plan was to record/replay only async requests and then use sync
> > > requests to bypass the record/replay, let me clearly state that this is
> > > the wrong approach: There are still guest devices which do synchronous
> > > I/O and need to be considered in the replay log, and you shouldn't
> > > prevent the analysis code from using AIO (in fact, using sync I/O in new
> > > code is very much frowned upon).
> >
> > Why do guest synchronous requests have to be recorded?
> > Aren't they completely deterministic?
> 
> Good point. I think you're right in practice. In theory, with dataplane
> (i.e. when running the request in a separate thread) it could happen,
> but I guess that isn't very compatible with replay anyway - and at the
> first sight I couldn't see it performing synchronous requests either.
> 
> > > I can explain in more detail what the block device structure looks like
> > > and how to access an image with and without record/replay, but first let
> > > me please know whether I guessed right what your problem is. Or if I
> > > missed your point, can you please describe in detail a case that
> > > wouldn't work?
> >
> > You have understood it correctly.
> > And what is the solution for bypassing one of the layers from component that
> > should not affect the replay?
> 
> For this, you need to understand how block drivers are stacked in qemu.
> Each driver in the stack has a separate struct BlockDriverState, which
> can be used to access its data. You could hook up things like this:
> 
>       virtio-blk              NBD server
>     --------------           ------------
>           |                        |
>           v                        |
>     +------------+                 |
>     | blkreplay  |                 |
>     +------------+                 |
>           |                        |
>           v                        |
>     +------------+                 |
>     |   qcow2    | <---------------+
>     +------------+
>           |
>           v
>     +------------+
>     | raw-posix  |
>     +------------+
>           |
>           v
>     --------------
>       filesystem
> 
> As you see, what I've chosen for the external analysis interface is just
> an NBD server as this is the component that we already have today. You
> could hook up any other (new) code there; the important part is that it
> doesn't work on the BDS of the blkreplay driver, but directly on the BDS
> of the qcow2 driver.
> 
> On the command line, it could look like this (this assumes that we don't
> add syntactic sugar that creates the blkreplay part automatically - we
> can always do that):
> 
>     -drive file=image.qcow2,if=none,id=img-direct
>     -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
>     -device virtio-blk-pci,drive=img-blkreplay

Are there any hints for driver with these options?
I can't figure out how to create _open function for that.
blkdebug driver seems similar, but it receives image name directly, without referencing
the lower level.


Pavel Dovgalyuk
Kevin Wolf Feb. 12, 2016, 9:44 a.m. UTC | #16
Am 12.02.2016 um 09:33 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 11.02.2016 um 12:00 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben:
> > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > > > > > However, I don't understand yet which layer do you offer as the candidate
> > > > > > > for record/replay? What functions should be changed?
> > > > > > > I would like to investigate this way, but I don't got it yet.
> > > > > >
> > > > > > At the core, I wouldn't change any existing function, but introduce a
> > > > > > new block driver. You could copy raw_bsd.c for a start and then tweak
> > > > > > it. Leave out functions that you don't want to support, and add the
> > > > > > necessary magic to .bdrv_co_readv/writev.
> > > > > >
> > > > > > Something like this (can probably be generalised for more than just
> > > > > > reads as the part after the bdrv_co_reads() call should be the same for
> > > > > > reads, writes and any other request types):
> > > > > >
> > > > > > int blkreplay_co_readv()
> > > > > > {
> > > > > >     BlockReplayState *s = bs->opaque;
> > > > > >     int reqid = s->reqid++;
> > > > > >
> > > > > >     bdrv_co_readv(bs->file, ...);
> > > > > >
> > > > > >     if (mode == record) {
> > > > > >         log(reqid, time);
> > > > > >     } else {
> > > > > >         assert(mode == replay);
> > > > > >         bool *done = req_replayed_list_get(reqid)
> > > > > >         if (done) {
> > > > > >             *done = true;
> > > > > >         } else {
> > > > > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > > > > >             qemu_coroutine_yield();
> > > > > >         }
> > > > > >     }
> > > > > > }
> > > > > >
> > > > > > /* called by replay.c */
> > > > > > int blkreplay_run_event()
> > > > > > {
> > > > > >     if (mode == replay) {
> > > > > >         co = req_completed_list_get(e.reqid);
> > > > > >         if (co) {
> > > > > >             qemu_coroutine_enter(co);
> > > > > >         } else {
> > > > > >             bool done = false;
> > > > > >             req_replayed_list_insert(reqid, &done);
> > > > > >             /* wait synchronously for completion */
> > > > > >             while (!done) {
> > > > > >                 aio_poll();
> > > > > >             }
> > > > > >         }
> > > > > >     }
> > > > > > }
> > > > > >
> > > > > > Where we could consider changing existing code is that it might be
> > > > > > desirable to automatically put an instance of this block driver on top
> > > > > > of every block device when record/replay is used. If we don't do that,
> > > > > > you need to explicitly specify -drive driver=blkreplay,...
> > > > >
> > > > > As far, as I understand, all synchronous read/write request are also passed
> > > > > through this coroutines layer.
> > > >
> > > > Yes, all read/write requests go through the same function internally, no
> > > > matter which external interface was used.
> > > >
> > > > > It means that every disk access in replay phase should match the recording phase.
> > > >
> > > > Right. If I'm not mistaken, this was the fundamental requirement you
> > > > have, so I wouldn't have suggested this otherwise.
> > > >
> > > > > Record/replay is intended to be used for debugging and analysis.
> > > > > When execution is replayed, guest machine cannot notice analysis overhead.
> > > > > Some analysis methods may include disk image reading. E.g., qemu-based
> > > > > analysis framework DECAF uses sleuthkit for disk forensics (
> > > > https://github.com/sycurelab/DECAF ).
> > > > > If similar framework will be used with replay, forensics disk access operations
> > > > > won't work if we will record/replay the coroutines.
> > > >
> > > > Sorry, I'm not sure if I can follow.
> > > >
> > > > If such analysis software runs in the guest, it's not a replay any more
> > > > and I completely fail to see what you're doing.
> > > >
> > > > If it's a qemu component independent from the guest, then my method
> > > > gives you a clean way to bypass the replay driver that wouldn't be
> > > > possible with yours.
> > >
> > > The second one. qemu may be extended with some components that
> > > perform guest introspection.
> > >
> > > > If your plan was to record/replay only async requests and then use sync
> > > > requests to bypass the record/replay, let me clearly state that this is
> > > > the wrong approach: There are still guest devices which do synchronous
> > > > I/O and need to be considered in the replay log, and you shouldn't
> > > > prevent the analysis code from using AIO (in fact, using sync I/O in new
> > > > code is very much frowned upon).
> > >
> > > Why do guest synchronous requests have to be recorded?
> > > Aren't they completely deterministic?
> > 
> > Good point. I think you're right in practice. In theory, with dataplane
> > (i.e. when running the request in a separate thread) it could happen,
> > but I guess that isn't very compatible with replay anyway - and at the
> > first sight I couldn't see it performing synchronous requests either.
> > 
> > > > I can explain in more detail what the block device structure looks like
> > > > and how to access an image with and without record/replay, but first let
> > > > me please know whether I guessed right what your problem is. Or if I
> > > > missed your point, can you please describe in detail a case that
> > > > wouldn't work?
> > >
> > > You have understood it correctly.
> > > And what is the solution for bypassing one of the layers from component that
> > > should not affect the replay?
> > 
> > For this, you need to understand how block drivers are stacked in qemu.
> > Each driver in the stack has a separate struct BlockDriverState, which
> > can be used to access its data. You could hook up things like this:
> > 
> >       virtio-blk              NBD server
> >     --------------           ------------
> >           |                        |
> >           v                        |
> >     +------------+                 |
> >     | blkreplay  |                 |
> >     +------------+                 |
> >           |                        |
> >           v                        |
> >     +------------+                 |
> >     |   qcow2    | <---------------+
> >     +------------+
> >           |
> >           v
> >     +------------+
> >     | raw-posix  |
> >     +------------+
> >           |
> >           v
> >     --------------
> >       filesystem
> > 
> > As you see, what I've chosen for the external analysis interface is just
> > an NBD server as this is the component that we already have today. You
> > could hook up any other (new) code there; the important part is that it
> > doesn't work on the BDS of the blkreplay driver, but directly on the BDS
> > of the qcow2 driver.
> > 
> > On the command line, it could look like this (this assumes that we don't
> > add syntactic sugar that creates the blkreplay part automatically - we
> > can always do that):
> > 
> >     -drive file=image.qcow2,if=none,id=img-direct
> >     -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
> >     -device virtio-blk-pci,drive=img-blkreplay
> 
> Are there any hints for driver with these options?
> I can't figure out how to create _open function for that.
> blkdebug driver seems similar, but it receives image name directly, without referencing
> the lower level.

Actually, the 'image' option of blkdebug works like this. For example, I
just tried this commandline and it works:

    -drive file=/tmp/test.qcow2,if=none,id=img
    -drive driver=blkdebug,image=img,if=virtio

Just ignore the legacy part with 'x-image' and pass NULL as the filename
to bdrv_open_child().

Another driver that could be useful as an example is quorum.

Kevin
Pavel Dovgalyuk Feb. 12, 2016, 1:19 p.m. UTC | #17
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben:
> > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben:
> > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > > But even this doesn't feel completely right, because block drivers are
> > > > > > > already layered and there is no need to hardcode something optional (and
> > > > > > > rarely used) in the hot code path that could just be another layer.
> > > > > > >
> > > > > > > I assume that you know beforehand if you want to replay something, so
> > > > > > > requiring you to configure your block devices with a replay driver on
> > > > > > > top of the stack seems reasonable enough.
> > > > > >
> > > > > > I cannot use block drivers for this. When driver functions are used, QEMU
> > > > > > is already used coroutines (and probably started bottom halves).
> > > > > > Coroutines make execution non-deterministic.
> > > > > > That's why we have to intercept blk_aio_ functions, that are called
> > > > > > deterministically.
> > > > >
> > > > > What does "deterministic" mean in this context, i.e. what are your exact
> > > > > requirements?
> > > >
> > > > "Deterministic" means that the replayed execution should run exactly
> > > > the same guest instructions in the same sequence, as in recording session.
> > >
> > > Okay. I think with this we can do better than what you have now.
> > >
> > > > > I don't think that coroutines introduce anything non-deterministic per
> > > > > se. Depending on what you mean by it, the block layer code paths in
> > > > > block.c may contain problematic code.
> > > >
> > > > They are non-deterministic if we need instruction-level accuracy.
> > > > Thread switching (and therefore callbacks and BH execution) is non-deterministic.
> > >
> > > Thread switching depends on an external event (the kernel scheduler
> > > deciding to switch), so agreed, if a thread switch ever influences what
> > > the guest sees, that would be a problem.
> > >
> > > Generally, however, callbacks and BHs don't involve a thread switch at
> > > all (BHs can be invoked from a different thread in theory, but we have
> > > very few of those cases and they shouldn't be visible for the guest).
> > > The same is true for coroutines, which are semantically equivalent to
> > > callbacks.
> > >
> > > > In two different executions these callbacks may happen at different moments of
> > > > time (counting in number of executed instructions).
> > > > All operations with virtual devices (including memory, interrupt controller,
> > > > and disk drive controller) should happen at deterministic moments of time
> > > > to be replayable.
> > >
> > > Right, so let's talk about what this external non-deterministic event
> > > really is.
> > >
> > > I think the only thing whose timing is unknown in the block layer is the
> > > completion of I/O requests. This non-determinism comes from the time the
> > > I/O syscalls made by the lowest layer (usually raw-posix) take.
> >
> > Right.
> >
> > > This means that we can add logic to remove the non-determinism at the
> > > point of our choice between raw-posix and the guest device emulation. A
> > > block driver on top is as good as anything else.
> > >
> > > While recording, this block driver would just pass the request to next
> > > lower layer (starting a request is deterministic, so it doesn't need to
> > > be logged) and once the request completes it logs it. While replaying,
> > > the completion of requests is delayed until we read it in the log; if we
> > > read it in the log and the request hasn't completed yet, we do a busy
> > > wait for it (while(!completed) aio_poll();).
> >
> > I tried serializing all bottom halves and worker thread callbacks in
> > previous version of the patches. That code was much more complicated
> > and error-prone than the current version. We had to classify all bottom
> > halves to recorded and non-recorded (because sometimes they are used
> > for qemu's purposes, not the guest ones).
> >
> > However, I don't understand yet which layer do you offer as the candidate
> > for record/replay? What functions should be changed?
> > I would like to investigate this way, but I don't got it yet.
> 
> At the core, I wouldn't change any existing function, but introduce a
> new block driver. You could copy raw_bsd.c for a start and then tweak
> it. Leave out functions that you don't want to support, and add the
> necessary magic to .bdrv_co_readv/writev.
> 
> Something like this (can probably be generalised for more than just
> reads as the part after the bdrv_co_reads() call should be the same for
> reads, writes and any other request types):
> 
> int blkreplay_co_readv()
> {
>     BlockReplayState *s = bs->opaque;
>     int reqid = s->reqid++;
> 
>     bdrv_co_readv(bs->file, ...);
> 
>     if (mode == record) {
>         log(reqid, time);
>     } else {
>         assert(mode == replay);
>         bool *done = req_replayed_list_get(reqid)
>         if (done) {
>             *done = true;
>         } else {
point A
>             req_completed_list_insert(reqid, qemu_coroutine_self());
>             qemu_coroutine_yield();
>         }
>     }
> }
> 
> /* called by replay.c */
> int blkreplay_run_event()
> {
>     if (mode == replay) {
>         co = req_completed_list_get(e.reqid);
>         if (co) {
>             qemu_coroutine_enter(co);
>         } else {
>             bool done = false;
>             req_replayed_list_insert(reqid, &done);
point B
>             /* wait synchronously for completion */
>             while (!done) {
>                 aio_poll();
>             }
>         }
>     }
> }

One more question about coroutines.
Are race conditions possible in this sample?
In replay mode we may call readv, and reach point A.
On the same time, we will read point B in another thread.
Then readv will yield and nobody will start it back?

Pavel Dovgalyuk
Kevin Wolf Feb. 12, 2016, 1:58 p.m. UTC | #18
Am 12.02.2016 um 14:19 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben:
> > > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben:
> > > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > > > But even this doesn't feel completely right, because block drivers are
> > > > > > > > already layered and there is no need to hardcode something optional (and
> > > > > > > > rarely used) in the hot code path that could just be another layer.
> > > > > > > >
> > > > > > > > I assume that you know beforehand if you want to replay something, so
> > > > > > > > requiring you to configure your block devices with a replay driver on
> > > > > > > > top of the stack seems reasonable enough.
> > > > > > >
> > > > > > > I cannot use block drivers for this. When driver functions are used, QEMU
> > > > > > > is already used coroutines (and probably started bottom halves).
> > > > > > > Coroutines make execution non-deterministic.
> > > > > > > That's why we have to intercept blk_aio_ functions, that are called
> > > > > > > deterministically.
> > > > > >
> > > > > > What does "deterministic" mean in this context, i.e. what are your exact
> > > > > > requirements?
> > > > >
> > > > > "Deterministic" means that the replayed execution should run exactly
> > > > > the same guest instructions in the same sequence, as in recording session.
> > > >
> > > > Okay. I think with this we can do better than what you have now.
> > > >
> > > > > > I don't think that coroutines introduce anything non-deterministic per
> > > > > > se. Depending on what you mean by it, the block layer code paths in
> > > > > > block.c may contain problematic code.
> > > > >
> > > > > They are non-deterministic if we need instruction-level accuracy.
> > > > > Thread switching (and therefore callbacks and BH execution) is non-deterministic.
> > > >
> > > > Thread switching depends on an external event (the kernel scheduler
> > > > deciding to switch), so agreed, if a thread switch ever influences what
> > > > the guest sees, that would be a problem.
> > > >
> > > > Generally, however, callbacks and BHs don't involve a thread switch at
> > > > all (BHs can be invoked from a different thread in theory, but we have
> > > > very few of those cases and they shouldn't be visible for the guest).
> > > > The same is true for coroutines, which are semantically equivalent to
> > > > callbacks.
> > > >
> > > > > In two different executions these callbacks may happen at different moments of
> > > > > time (counting in number of executed instructions).
> > > > > All operations with virtual devices (including memory, interrupt controller,
> > > > > and disk drive controller) should happen at deterministic moments of time
> > > > > to be replayable.
> > > >
> > > > Right, so let's talk about what this external non-deterministic event
> > > > really is.
> > > >
> > > > I think the only thing whose timing is unknown in the block layer is the
> > > > completion of I/O requests. This non-determinism comes from the time the
> > > > I/O syscalls made by the lowest layer (usually raw-posix) take.
> > >
> > > Right.
> > >
> > > > This means that we can add logic to remove the non-determinism at the
> > > > point of our choice between raw-posix and the guest device emulation. A
> > > > block driver on top is as good as anything else.
> > > >
> > > > While recording, this block driver would just pass the request to next
> > > > lower layer (starting a request is deterministic, so it doesn't need to
> > > > be logged) and once the request completes it logs it. While replaying,
> > > > the completion of requests is delayed until we read it in the log; if we
> > > > read it in the log and the request hasn't completed yet, we do a busy
> > > > wait for it (while(!completed) aio_poll();).
> > >
> > > I tried serializing all bottom halves and worker thread callbacks in
> > > previous version of the patches. That code was much more complicated
> > > and error-prone than the current version. We had to classify all bottom
> > > halves to recorded and non-recorded (because sometimes they are used
> > > for qemu's purposes, not the guest ones).
> > >
> > > However, I don't understand yet which layer do you offer as the candidate
> > > for record/replay? What functions should be changed?
> > > I would like to investigate this way, but I don't got it yet.
> > 
> > At the core, I wouldn't change any existing function, but introduce a
> > new block driver. You could copy raw_bsd.c for a start and then tweak
> > it. Leave out functions that you don't want to support, and add the
> > necessary magic to .bdrv_co_readv/writev.
> > 
> > Something like this (can probably be generalised for more than just
> > reads as the part after the bdrv_co_reads() call should be the same for
> > reads, writes and any other request types):
> > 
> > int blkreplay_co_readv()
> > {
> >     BlockReplayState *s = bs->opaque;
> >     int reqid = s->reqid++;
> > 
> >     bdrv_co_readv(bs->file, ...);
> > 
> >     if (mode == record) {
> >         log(reqid, time);
> >     } else {
> >         assert(mode == replay);
> >         bool *done = req_replayed_list_get(reqid)
> >         if (done) {
> >             *done = true;
> >         } else {
> point A
> >             req_completed_list_insert(reqid, qemu_coroutine_self());
> >             qemu_coroutine_yield();
> >         }
> >     }
> > }
> > 
> > /* called by replay.c */
> > int blkreplay_run_event()
> > {
> >     if (mode == replay) {
> >         co = req_completed_list_get(e.reqid);
> >         if (co) {
> >             qemu_coroutine_enter(co);
> >         } else {
> >             bool done = false;
> >             req_replayed_list_insert(reqid, &done);
> point B
> >             /* wait synchronously for completion */
> >             while (!done) {
> >                 aio_poll();
> >             }
> >         }
> >     }
> > }
> 
> One more question about coroutines.
> Are race conditions possible in this sample?
> In replay mode we may call readv, and reach point A.
> On the same time, we will read point B in another thread.
> Then readv will yield and nobody will start it back?

There are two aspects to this:

* Real multithreading doesn't exist in the block layer. All block driver
  functions are only called with the mutex in the AioContext held. There
  is exactly one AioContext per BDS, so no two threads can possible be
  operating on the same BDS at the same time.

* Coroutines are different from threads in that they aren't preemptive.
  They are only interrupted in places where they explicitly yield.

Of course, in order for this to work, we actually need to take the mutex
before calling blkreplay_run_event(), which is called directly from the
replay code (which runs in the mainloop thread? Or vcpu?).

So I think you need to have a aio_context_acquire(bs->aio_context) and
aio_context_release(bs->aio_context) around the function; either here or
in the calling replay code.

Kevin
Pavel Dovgalyuk Feb. 15, 2016, 8:38 a.m. UTC | #19
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> > >
> > > int blkreplay_co_readv()
> > > {
> > >     BlockReplayState *s = bs->opaque;
> > >     int reqid = s->reqid++;
> > >
> > >     bdrv_co_readv(bs->file, ...);
> > >
> > >     if (mode == record) {
> > >         log(reqid, time);
> > >     } else {
> > >         assert(mode == replay);
> > >         bool *done = req_replayed_list_get(reqid)
> > >         if (done) {
> > >             *done = true;
> > >         } else {
> > point A
> > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > >             qemu_coroutine_yield();
> > >         }
> > >     }
> > > }
> > >
> > > /* called by replay.c */
> > > int blkreplay_run_event()
> > > {
> > >     if (mode == replay) {
> > >         co = req_completed_list_get(e.reqid);
> > >         if (co) {
> > >             qemu_coroutine_enter(co);
> > >         } else {
> > >             bool done = false;
> > >             req_replayed_list_insert(reqid, &done);
> > point B
> > >             /* wait synchronously for completion */
> > >             while (!done) {
> > >                 aio_poll();
> > >             }
> > >         }
> > >     }
> > > }
> >
> > One more question about coroutines.
> > Are race conditions possible in this sample?
> > In replay mode we may call readv, and reach point A.
> > On the same time, we will read point B in another thread.
> > Then readv will yield and nobody will start it back?
> 
> There are two aspects to this:
> 
> * Real multithreading doesn't exist in the block layer. All block driver
>   functions are only called with the mutex in the AioContext held. There
>   is exactly one AioContext per BDS, so no two threads can possible be
>   operating on the same BDS at the same time.
> 
> * Coroutines are different from threads in that they aren't preemptive.
>   They are only interrupted in places where they explicitly yield.
> 
> Of course, in order for this to work, we actually need to take the mutex
> before calling blkreplay_run_event(), which is called directly from the
> replay code (which runs in the mainloop thread? Or vcpu?).

blkreplay_run_event() is called from replay code which is protected by mutex.
This function may be called from io and vcpu threads, because both of them
have replay functions invocations.

> So I think you need to have a aio_context_acquire(bs->aio_context) and
> aio_context_release(bs->aio_context) around the function; either here or
> in the calling replay code.

And what about coroutine code? Does it call aio_context_acquire somewhere?

Pavel Dovgalyuk
Kevin Wolf Feb. 15, 2016, 9:10 a.m. UTC | #20
Am 15.02.2016 um 09:38 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > >
> > > > int blkreplay_co_readv()
> > > > {
> > > >     BlockReplayState *s = bs->opaque;
> > > >     int reqid = s->reqid++;
> > > >
> > > >     bdrv_co_readv(bs->file, ...);
> > > >
> > > >     if (mode == record) {
> > > >         log(reqid, time);
> > > >     } else {
> > > >         assert(mode == replay);
> > > >         bool *done = req_replayed_list_get(reqid)
> > > >         if (done) {
> > > >             *done = true;
> > > >         } else {
> > > point A
> > > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > > >             qemu_coroutine_yield();
> > > >         }
> > > >     }
> > > > }
> > > >
> > > > /* called by replay.c */
> > > > int blkreplay_run_event()
> > > > {
> > > >     if (mode == replay) {
> > > >         co = req_completed_list_get(e.reqid);
> > > >         if (co) {
> > > >             qemu_coroutine_enter(co);
> > > >         } else {
> > > >             bool done = false;
> > > >             req_replayed_list_insert(reqid, &done);
> > > point B
> > > >             /* wait synchronously for completion */
> > > >             while (!done) {
> > > >                 aio_poll();
> > > >             }
> > > >         }
> > > >     }
> > > > }
> > >
> > > One more question about coroutines.
> > > Are race conditions possible in this sample?
> > > In replay mode we may call readv, and reach point A.
> > > On the same time, we will read point B in another thread.
> > > Then readv will yield and nobody will start it back?
> > 
> > There are two aspects to this:
> > 
> > * Real multithreading doesn't exist in the block layer. All block driver
> >   functions are only called with the mutex in the AioContext held. There
> >   is exactly one AioContext per BDS, so no two threads can possible be
> >   operating on the same BDS at the same time.
> > 
> > * Coroutines are different from threads in that they aren't preemptive.
> >   They are only interrupted in places where they explicitly yield.
> > 
> > Of course, in order for this to work, we actually need to take the mutex
> > before calling blkreplay_run_event(), which is called directly from the
> > replay code (which runs in the mainloop thread? Or vcpu?).
> 
> blkreplay_run_event() is called from replay code which is protected by mutex.
> This function may be called from io and vcpu threads, because both of them
> have replay functions invocations.

I assume that the mutex you mean is the BQL. This is enough for cases
which use the main loop AioContext, but not for block devices with
dataplane threads.

> > So I think you need to have a aio_context_acquire(bs->aio_context) and
> > aio_context_release(bs->aio_context) around the function; either here or
> > in the calling replay code.
> 
> And what about coroutine code? Does it call aio_context_acquire somewhere?

The coroutine code doesn't call aio_context_acquire() by itself. But
coroutines may only be entered when the appropriate AioContext lock is
held, so they are always protected.

Kevin
Pavel Dovgalyuk Feb. 15, 2016, 9:14 a.m. UTC | #21
> From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > >
> > > > int blkreplay_co_readv()
> > > > {
> > > >     BlockReplayState *s = bs->opaque;
> > > >     int reqid = s->reqid++;
> > > >
> > > >     bdrv_co_readv(bs->file, ...);
> > > >
> > > >     if (mode == record) {
> > > >         log(reqid, time);
> > > >     } else {
> > > >         assert(mode == replay);
> > > >         bool *done = req_replayed_list_get(reqid)
> > > >         if (done) {
> > > >             *done = true;
> > > >         } else {
> > > point A
> > > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > > >             qemu_coroutine_yield();
> > > >         }
> > > >     }
> > > > }
> > > >
> > > > /* called by replay.c */
> > > > int blkreplay_run_event()
> > > > {
> > > >     if (mode == replay) {
> > > >         co = req_completed_list_get(e.reqid);
> > > >         if (co) {
> > > >             qemu_coroutine_enter(co);
> > > >         } else {
> > > >             bool done = false;
> > > >             req_replayed_list_insert(reqid, &done);
> > > point B
> > > >             /* wait synchronously for completion */
> > > >             while (!done) {
> > > >                 aio_poll();
> > > >             }
> > > >         }
> > > >     }
> > > > }
> > >
> > > One more question about coroutines.
> > > Are race conditions possible in this sample?
> > > In replay mode we may call readv, and reach point A.
> > > On the same time, we will read point B in another thread.
> > > Then readv will yield and nobody will start it back?
> >
> > There are two aspects to this:
> >
> > * Real multithreading doesn't exist in the block layer. All block driver
> >   functions are only called with the mutex in the AioContext held. There
> >   is exactly one AioContext per BDS, so no two threads can possible be
> >   operating on the same BDS at the same time.
> >
> > * Coroutines are different from threads in that they aren't preemptive.
> >   They are only interrupted in places where they explicitly yield.
> >
> > Of course, in order for this to work, we actually need to take the mutex
> > before calling blkreplay_run_event(), which is called directly from the
> > replay code (which runs in the mainloop thread? Or vcpu?).
> 
> blkreplay_run_event() is called from replay code which is protected by mutex.
> This function may be called from io and vcpu threads, because both of them
> have replay functions invocations.

Now I've encountered a situation where blkreplay_run_event is called from read coroutine:
bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event
           \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer)

bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation.
This is probably because aio_poll has taken the aio context?
How can I resolve this?

Pavel Dovgalyuk
Kevin Wolf Feb. 15, 2016, 9:38 a.m. UTC | #22
Am 15.02.2016 um 10:14 hat Pavel Dovgalyuk geschrieben:
> > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > >
> > > > > int blkreplay_co_readv()
> > > > > {
> > > > >     BlockReplayState *s = bs->opaque;
> > > > >     int reqid = s->reqid++;
> > > > >
> > > > >     bdrv_co_readv(bs->file, ...);
> > > > >
> > > > >     if (mode == record) {
> > > > >         log(reqid, time);
> > > > >     } else {
> > > > >         assert(mode == replay);
> > > > >         bool *done = req_replayed_list_get(reqid)
> > > > >         if (done) {
> > > > >             *done = true;
> > > > >         } else {
> > > > point A
> > > > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > > > >             qemu_coroutine_yield();
> > > > >         }
> > > > >     }
> > > > > }
> > > > >
> > > > > /* called by replay.c */
> > > > > int blkreplay_run_event()
> > > > > {
> > > > >     if (mode == replay) {
> > > > >         co = req_completed_list_get(e.reqid);
> > > > >         if (co) {
> > > > >             qemu_coroutine_enter(co);
> > > > >         } else {
> > > > >             bool done = false;
> > > > >             req_replayed_list_insert(reqid, &done);
> > > > point B
> > > > >             /* wait synchronously for completion */
> > > > >             while (!done) {
> > > > >                 aio_poll();
> > > > >             }
> > > > >         }
> > > > >     }
> > > > > }
> > > >
> > > > One more question about coroutines.
> > > > Are race conditions possible in this sample?
> > > > In replay mode we may call readv, and reach point A.
> > > > On the same time, we will read point B in another thread.
> > > > Then readv will yield and nobody will start it back?
> > >
> > > There are two aspects to this:
> > >
> > > * Real multithreading doesn't exist in the block layer. All block driver
> > >   functions are only called with the mutex in the AioContext held. There
> > >   is exactly one AioContext per BDS, so no two threads can possible be
> > >   operating on the same BDS at the same time.
> > >
> > > * Coroutines are different from threads in that they aren't preemptive.
> > >   They are only interrupted in places where they explicitly yield.
> > >
> > > Of course, in order for this to work, we actually need to take the mutex
> > > before calling blkreplay_run_event(), which is called directly from the
> > > replay code (which runs in the mainloop thread? Or vcpu?).
> > 
> > blkreplay_run_event() is called from replay code which is protected by mutex.
> > This function may be called from io and vcpu threads, because both of them
> > have replay functions invocations.
> 
> Now I've encountered a situation where blkreplay_run_event is called from read coroutine:
> bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event
>            \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer)
> 
> bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation.
> This is probably because aio_poll has taken the aio context?
> How can I resolve this?

First of all, I'm not sure if running replay events from
qemu_clock_get_ns() is such a great idea. This is not a function that
callers expect to change any state. If you absolutely have to do it
there instead of in the clock device emulations, maybe restricting it to
replaying clock events could make it a bit more harmless.

Anyway, what does "can't proceed" mean? The coroutine yields because
it's waiting for I/O, but it is never reentered? Or is it hanging while
trying to acquire a lock?

Calling the callbacks that reenter a yielded coroutine is generally the
job of aio_poll(). After reentering the coroutine, blkreplay_run_event()
should return back to its caller and therefore indirectly to aio_poll(),
which should drive the events. Sounds like it should be working.

Can you provide more detail about the exact place where it's hanging,
both in the coroutine and in the main "coroutine" that executes
aio_poll()?

Kevin
Pavel Dovgalyuk Feb. 15, 2016, 11:19 a.m. UTC | #23
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 15.02.2016 um 10:14 hat Pavel Dovgalyuk geschrieben:
> > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > >
> > > > > > int blkreplay_co_readv()
> > > > > > {	
> > > > > >     BlockReplayState *s = bs->opaque;
> > > > > >     int reqid = s->reqid++;
> > > > > >
> > > > > >     bdrv_co_readv(bs->file, ...);
> > > > > >
> > > > > >     if (mode == record) {
> > > > > >         log(reqid, time);
> > > > > >     } else {
> > > > > >         assert(mode == replay);
> > > > > >         bool *done = req_replayed_list_get(reqid)
> > > > > >         if (done) {
> > > > > >             *done = true;
> > > > > >         } else {
> > > > > point A
> > > > > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > > > > >             qemu_coroutine_yield();
> > > > > >         }
> > > > > >     }
> > > > > > }
> > > > > >
> > > > > > /* called by replay.c */
> > > > > > int blkreplay_run_event()
> > > > > > {
> > > > > >     if (mode == replay) {
> > > > > >         co = req_completed_list_get(e.reqid);
> > > > > >         if (co) {
> > > > > >             qemu_coroutine_enter(co);
> > > > > >         } else {
> > > > > >             bool done = false;
> > > > > >             req_replayed_list_insert(reqid, &done);
> > > > > point B
> > > > > >             /* wait synchronously for completion */
> > > > > >             while (!done) {
> > > > > >                 aio_poll();
> > > > > >             }
> > > > > >         }
> > > > > >     }
> > > > > > }
> > > >
> > Now I've encountered a situation where blkreplay_run_event is called from read coroutine:
> > bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event
> >            \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer)
> >
> > bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation.
> > This is probably because aio_poll has taken the aio context?
> > How can I resolve this?
> 
> First of all, I'm not sure if running replay events from
> qemu_clock_get_ns() is such a great idea. This is not a function that
> callers expect to change any state. If you absolutely have to do it
> there instead of in the clock device emulations, maybe restricting it to
> replaying clock events could make it a bit more harmless.

qemu_clock_get_ns() wants to read some clock data from the log.
While reading, it finds block driver event and tries to proceed it.
These block events may occur at any moment, because blkreplay_co_readv()
writes them immediately as executes.

Alternative approach is adding these events to the queue and executing them
when checkpoint will be met. I'm not sure that this is easy to implement with
coroutines.

> Anyway, what does "can't proceed" mean? The coroutine yields because
> it's waiting for I/O, but it is never reentered? Or is it hanging while
> trying to acquire a lock?

bdrv_co_io_em() (raw file read/write) yields and waits inifinitely to return, because
aio_poll hands in some clock read.

> Calling the callbacks that reenter a yielded coroutine is generally the
> job of aio_poll(). After reentering the coroutine, blkreplay_run_event()
> should return back to its caller and therefore indirectly to aio_poll(),
> which should drive the events. Sounds like it should be working.

Yield occurred inside blkreplay_co_readv()->bdrv_co_readv() before
the request was added to the queue.


Pavel Dovgalyuk
Kevin Wolf Feb. 15, 2016, 12:46 p.m. UTC | #24
Am 15.02.2016 um 12:19 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 15.02.2016 um 10:14 hat Pavel Dovgalyuk geschrieben:
> > > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > >
> > > > > > > int blkreplay_co_readv()
> > > > > > > {	
> > > > > > >     BlockReplayState *s = bs->opaque;
> > > > > > >     int reqid = s->reqid++;
> > > > > > >
> > > > > > >     bdrv_co_readv(bs->file, ...);
> > > > > > >
> > > > > > >     if (mode == record) {
> > > > > > >         log(reqid, time);
> > > > > > >     } else {
> > > > > > >         assert(mode == replay);
> > > > > > >         bool *done = req_replayed_list_get(reqid)
> > > > > > >         if (done) {
> > > > > > >             *done = true;
> > > > > > >         } else {
> > > > > > point A
> > > > > > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > > > > > >             qemu_coroutine_yield();
> > > > > > >         }
> > > > > > >     }
> > > > > > > }
> > > > > > >
> > > > > > > /* called by replay.c */
> > > > > > > int blkreplay_run_event()
> > > > > > > {
> > > > > > >     if (mode == replay) {
> > > > > > >         co = req_completed_list_get(e.reqid);
> > > > > > >         if (co) {
> > > > > > >             qemu_coroutine_enter(co);
> > > > > > >         } else {
> > > > > > >             bool done = false;
> > > > > > >             req_replayed_list_insert(reqid, &done);
> > > > > > point B
> > > > > > >             /* wait synchronously for completion */
> > > > > > >             while (!done) {
> > > > > > >                 aio_poll();
> > > > > > >             }
> > > > > > >         }
> > > > > > >     }
> > > > > > > }
> > > > >
> > > Now I've encountered a situation where blkreplay_run_event is called from read coroutine:
> > > bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event
> > >            \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer)
> > >
> > > bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation.
> > > This is probably because aio_poll has taken the aio context?
> > > How can I resolve this?
> > 
> > First of all, I'm not sure if running replay events from
> > qemu_clock_get_ns() is such a great idea. This is not a function that
> > callers expect to change any state. If you absolutely have to do it
> > there instead of in the clock device emulations, maybe restricting it to
> > replaying clock events could make it a bit more harmless.
> 
> qemu_clock_get_ns() wants to read some clock data from the log.
> While reading, it finds block driver event and tries to proceed it.
> These block events may occur at any moment, because blkreplay_co_readv()
> writes them immediately as executes.

I understand what's happening. I just think it's asking for bugs when
you do this in a low-level function like qemu_clock_get_ns() that nobody
expects to have any side effects. But we'll see how many bugs this will
cause.

> Alternative approach is adding these events to the queue and executing them
> when checkpoint will be met. I'm not sure that this is easy to implement with
> coroutines.

Shouldn't be hard, you're queueing requests already. But it's unlikely
to be the right solution.

Why does qemu_clock_get_ns() need to replay events to begin with?
Shouldn't it just return the current state without replaying anything
new?

> > Anyway, what does "can't proceed" mean? The coroutine yields because
> > it's waiting for I/O, but it is never reentered? Or is it hanging while
> > trying to acquire a lock?
> 
> bdrv_co_io_em() (raw file read/write) yields and waits inifinitely to return, because
> aio_poll hands in some clock read.

Okay, so the coroutine is sleeping while it waits for I/O to complete.
What's missing is the I/O completion callback.

When the coroutine yields in bdrv_co_io_em(), the caller coroutines
continues execution at blkreplay_run_event(). My naive expectation would
be that it returns to replay_read_clock(), which in turn returns to
qemu_clock_get_ns(), which knows the right time now and returns to
aio_poll(), which would process I/O completions and call the callback.
The callback would reenter the other coroutine in bdrv_co_io_em() and
the request would be completed.

As the callback isn't happening, something in this unwinding process
don't work as expected and we're actually hanging somewhere else. Do you
know where that is?

Kevin
Pavel Dovgalyuk Feb. 15, 2016, 1:54 p.m. UTC | #25
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 15.02.2016 um 10:14 hat Pavel Dovgalyuk geschrieben:
> > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > >
> > > > > > int blkreplay_co_readv()
> > > > > > {
> > > > > >     BlockReplayState *s = bs->opaque;
> > > > > >     int reqid = s->reqid++;
> > > > > >
> > > > > >     bdrv_co_readv(bs->file, ...);
> > > > > >
> > > > > >     if (mode == record) {
> > > > > >         log(reqid, time);
> > > > > >     } else {
> > > > > >         assert(mode == replay);
> > > > > >         bool *done = req_replayed_list_get(reqid)
> > > > > >         if (done) {
> > > > > >             *done = true;
> > > > > >         } else {
> > > > > point A
> > > > > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > > > > >             qemu_coroutine_yield();
> > > > > >         }
> > > > > >     }
> > > > > > }
> > > > > >
> > > > > > /* called by replay.c */
> > > > > > int blkreplay_run_event()
> > > > > > {
> > > > > >     if (mode == replay) {
> > > > > >         co = req_completed_list_get(e.reqid);
> > > > > >         if (co) {
> > > > > >             qemu_coroutine_enter(co);
> > > > > >         } else {
> > > > > >             bool done = false;
> > > > > >             req_replayed_list_insert(reqid, &done);
> > > > > point B
> > > > > >             /* wait synchronously for completion */
> > > > > >             while (!done) {
> > > > > >                 aio_poll();
> > > > > >             }
> > > > > >         }
> > > > > >     }
> > > > > > }
> > > > >
> > > > > One more question about coroutines.
> > > > > Are race conditions possible in this sample?
> > > > > In replay mode we may call readv, and reach point A.
> > > > > On the same time, we will read point B in another thread.
> > > > > Then readv will yield and nobody will start it back?
> > > >
> > > > There are two aspects to this:
> > > >
> > > > * Real multithreading doesn't exist in the block layer. All block driver
> > > >   functions are only called with the mutex in the AioContext held. There
> > > >   is exactly one AioContext per BDS, so no two threads can possible be
> > > >   operating on the same BDS at the same time.
> > > >
> > > > * Coroutines are different from threads in that they aren't preemptive.
> > > >   They are only interrupted in places where they explicitly yield.
> > > >
> > > > Of course, in order for this to work, we actually need to take the mutex
> > > > before calling blkreplay_run_event(), which is called directly from the
> > > > replay code (which runs in the mainloop thread? Or vcpu?).
> > >
> > > blkreplay_run_event() is called from replay code which is protected by mutex.
> > > This function may be called from io and vcpu threads, because both of them
> > > have replay functions invocations.
> >
> > Now I've encountered a situation where blkreplay_run_event is called from read coroutine:
> > bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event
> >            \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer)
> >
> > bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation.
> > This is probably because aio_poll has taken the aio context?
> > How can I resolve this?
> 
> First of all, I'm not sure if running replay events from
> qemu_clock_get_ns() is such a great idea. This is not a function that
> callers expect to change any state. If you absolutely have to do it
> there instead of in the clock device emulations, maybe restricting it to
> replaying clock events could make it a bit more harmless.

Only virtual clock is emulated, and host clock is read from the host
real time sources and therefore has to be saved into the log.

There could be asynchronous events that occur in non-cpu threads.
For now these events are shutdown request and block task execution.
They may "hide" following clock (or another one) events. That is why
we process them until synchronous event (like clock, instructions 
execution, or checkpoint) is met.


> Anyway, what does "can't proceed" mean? The coroutine yields because
> it's waiting for I/O, but it is never reentered? Or is it hanging while
> trying to acquire a lock?

I've solved this problem by slightly modifying the queue.
I haven't yet made BlockDriverState assignment to the request ids.
Therefore aio_poll was temporarily replaced with usleep.
Now execution starts and hangs at some random moment of OS loading.

Here is the current version of blkreplay functions:

static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
{
    uint32_t reqid = request_id++;
    Request *req;
    req = block_request_insert(reqid, bs, qemu_coroutine_self());
    bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);

    if (replay_mode == REPLAY_MODE_RECORD) {
        replay_save_block_event(reqid);
    } else {
        assert(replay_mode == REPLAY_MODE_PLAY);
        qemu_coroutine_yield();
    }
    block_request_remove(req);
    
    return 0;
}

void replay_run_block_event(uint32_t id)
{
    Request *req;
    if (replay_mode == REPLAY_MODE_PLAY) {
        while (!(req = block_request_find(id))) {
            //aio_poll(bdrv_get_aio_context(req->bs), true);
            usleep(1);
        }
        qemu_coroutine_enter(req->co, NULL);
    }
}

> Can you provide more detail about the exact place where it's hanging,
> both in the coroutine and in the main "coroutine" that executes
> aio_poll()?

In this version replay_run_block_event() executes while loop.
I haven't found what other threads do, because the debugger doesn't show me
call stack when thread is waiting in some blocking function.

Pavel Dovgalyuk
Kevin Wolf Feb. 15, 2016, 2:06 p.m. UTC | #26
Am 15.02.2016 um 14:54 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 15.02.2016 um 10:14 hat Pavel Dovgalyuk geschrieben:
> > > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > >
> > > > > > > int blkreplay_co_readv()
> > > > > > > {
> > > > > > >     BlockReplayState *s = bs->opaque;
> > > > > > >     int reqid = s->reqid++;
> > > > > > >
> > > > > > >     bdrv_co_readv(bs->file, ...);
> > > > > > >
> > > > > > >     if (mode == record) {
> > > > > > >         log(reqid, time);
> > > > > > >     } else {
> > > > > > >         assert(mode == replay);
> > > > > > >         bool *done = req_replayed_list_get(reqid)
> > > > > > >         if (done) {
> > > > > > >             *done = true;
> > > > > > >         } else {
> > > > > > point A
> > > > > > >             req_completed_list_insert(reqid, qemu_coroutine_self());
> > > > > > >             qemu_coroutine_yield();
> > > > > > >         }
> > > > > > >     }
> > > > > > > }
> > > > > > >
> > > > > > > /* called by replay.c */
> > > > > > > int blkreplay_run_event()
> > > > > > > {
> > > > > > >     if (mode == replay) {
> > > > > > >         co = req_completed_list_get(e.reqid);
> > > > > > >         if (co) {
> > > > > > >             qemu_coroutine_enter(co);
> > > > > > >         } else {
> > > > > > >             bool done = false;
> > > > > > >             req_replayed_list_insert(reqid, &done);
> > > > > > point B
> > > > > > >             /* wait synchronously for completion */
> > > > > > >             while (!done) {
> > > > > > >                 aio_poll();
> > > > > > >             }
> > > > > > >         }
> > > > > > >     }
> > > > > > > }
> > > > > >
> > > > > > One more question about coroutines.
> > > > > > Are race conditions possible in this sample?
> > > > > > In replay mode we may call readv, and reach point A.
> > > > > > On the same time, we will read point B in another thread.
> > > > > > Then readv will yield and nobody will start it back?
> > > > >
> > > > > There are two aspects to this:
> > > > >
> > > > > * Real multithreading doesn't exist in the block layer. All block driver
> > > > >   functions are only called with the mutex in the AioContext held. There
> > > > >   is exactly one AioContext per BDS, so no two threads can possible be
> > > > >   operating on the same BDS at the same time.
> > > > >
> > > > > * Coroutines are different from threads in that they aren't preemptive.
> > > > >   They are only interrupted in places where they explicitly yield.
> > > > >
> > > > > Of course, in order for this to work, we actually need to take the mutex
> > > > > before calling blkreplay_run_event(), which is called directly from the
> > > > > replay code (which runs in the mainloop thread? Or vcpu?).
> > > >
> > > > blkreplay_run_event() is called from replay code which is protected by mutex.
> > > > This function may be called from io and vcpu threads, because both of them
> > > > have replay functions invocations.
> > >
> > > Now I've encountered a situation where blkreplay_run_event is called from read coroutine:
> > > bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event
> > >            \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer)
> > >
> > > bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation.
> > > This is probably because aio_poll has taken the aio context?
> > > How can I resolve this?
> > 
> > First of all, I'm not sure if running replay events from
> > qemu_clock_get_ns() is such a great idea. This is not a function that
> > callers expect to change any state. If you absolutely have to do it
> > there instead of in the clock device emulations, maybe restricting it to
> > replaying clock events could make it a bit more harmless.
> 
> Only virtual clock is emulated, and host clock is read from the host
> real time sources and therefore has to be saved into the log.

Isn't the host clock invisible to the guest anyway?

> There could be asynchronous events that occur in non-cpu threads.
> For now these events are shutdown request and block task execution.
> They may "hide" following clock (or another one) events. That is why
> we process them until synchronous event (like clock, instructions 
> execution, or checkpoint) is met.
> 
> 
> > Anyway, what does "can't proceed" mean? The coroutine yields because
> > it's waiting for I/O, but it is never reentered? Or is it hanging while
> > trying to acquire a lock?
> 
> I've solved this problem by slightly modifying the queue.
> I haven't yet made BlockDriverState assignment to the request ids.
> Therefore aio_poll was temporarily replaced with usleep.
> Now execution starts and hangs at some random moment of OS loading.
> 
> Here is the current version of blkreplay functions:
> 
> static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
>     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> {
>     uint32_t reqid = request_id++;
>     Request *req;
>     req = block_request_insert(reqid, bs, qemu_coroutine_self());
>     bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);
> 
>     if (replay_mode == REPLAY_MODE_RECORD) {
>         replay_save_block_event(reqid);
>     } else {
>         assert(replay_mode == REPLAY_MODE_PLAY);
>         qemu_coroutine_yield();
>     }
>     block_request_remove(req);
>     
>     return 0;
> }
> 
> void replay_run_block_event(uint32_t id)
> {
>     Request *req;
>     if (replay_mode == REPLAY_MODE_PLAY) {
>         while (!(req = block_request_find(id))) {
>             //aio_poll(bdrv_get_aio_context(req->bs), true);
>             usleep(1);
>         }

How is this loop supposed to make any progress?

And I still don't understand why aio_poll() doesn't work and where it
hangs.

Kevin

>         qemu_coroutine_enter(req->co, NULL);
>     }
> }
> 
> > Can you provide more detail about the exact place where it's hanging,
> > both in the coroutine and in the main "coroutine" that executes
> > aio_poll()?
> 
> In this version replay_run_block_event() executes while loop.
> I haven't found what other threads do, because the debugger doesn't show me
> call stack when thread is waiting in some blocking function.
> 
> Pavel Dovgalyuk
>
Pavel Dovgalyuk Feb. 15, 2016, 2:24 p.m. UTC | #27
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> > >
> > > First of all, I'm not sure if running replay events from
> > > qemu_clock_get_ns() is such a great idea. This is not a function that
> > > callers expect to change any state. If you absolutely have to do it
> > > there instead of in the clock device emulations, maybe restricting it to
> > > replaying clock events could make it a bit more harmless.
> >
> > Only virtual clock is emulated, and host clock is read from the host
> > real time sources and therefore has to be saved into the log.
> 
> Isn't the host clock invisible to the guest anyway?

It isn't. Host clock is used by guest RTC.

> > There could be asynchronous events that occur in non-cpu threads.
> > For now these events are shutdown request and block task execution.
> > They may "hide" following clock (or another one) events. That is why
> > we process them until synchronous event (like clock, instructions
> > execution, or checkpoint) is met.
> >
> >
> > > Anyway, what does "can't proceed" mean? The coroutine yields because
> > > it's waiting for I/O, but it is never reentered? Or is it hanging while
> > > trying to acquire a lock?
> >
> > I've solved this problem by slightly modifying the queue.
> > I haven't yet made BlockDriverState assignment to the request ids.
> > Therefore aio_poll was temporarily replaced with usleep.
> > Now execution starts and hangs at some random moment of OS loading.
> >
> > Here is the current version of blkreplay functions:
> >
> > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
> >     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> > {
> >     uint32_t reqid = request_id++;
> >     Request *req;
> >     req = block_request_insert(reqid, bs, qemu_coroutine_self());
> >     bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);
> >
> >     if (replay_mode == REPLAY_MODE_RECORD) {
> >         replay_save_block_event(reqid);
> >     } else {
> >         assert(replay_mode == REPLAY_MODE_PLAY);
> >         qemu_coroutine_yield();
> >     }
> >     block_request_remove(req);
> >
> >     return 0;
> > }
> >
> > void replay_run_block_event(uint32_t id)
> > {
> >     Request *req;
> >     if (replay_mode == REPLAY_MODE_PLAY) {
> >         while (!(req = block_request_find(id))) {
> >             //aio_poll(bdrv_get_aio_context(req->bs), true);
> >             usleep(1);
> >         }
> 
> How is this loop supposed to make any progress?

This loop does not supposed to make any progress. It waits until block_request_insert
call is added to the queue.

> And I still don't understand why aio_poll() doesn't work and where it
> hangs.

aio_poll hangs if "req = block_request_insert(reqid, bs, qemu_coroutine_self());" line
is executed after bdrv_co_readv. When bdrv_co_readv yields, replay_run_block_event has no
information about pending request and cannot jump to its coroutine.
Maybe I should implement aio_poll execution there to make progress in that case?

> >         qemu_coroutine_enter(req->co, NULL);
> >     }
> > }


Pavel Dovgalyuk
Pavel Dovgalyuk Feb. 15, 2016, 2:50 p.m. UTC | #28
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> > void replay_run_block_event(uint32_t id)
> > {
> >     Request *req;
> >     if (replay_mode == REPLAY_MODE_PLAY) {
> >         while (!(req = block_request_find(id))) {
> >             //aio_poll(bdrv_get_aio_context(req->bs), true);
> >             usleep(1);
> >         }
> 
> >         qemu_coroutine_enter(req->co, NULL);
> >     }
> > }
> >
> > > Can you provide more detail about the exact place where it's hanging,
> > > both in the coroutine and in the main "coroutine" that executes
> > > aio_poll()?

I've tried to replace usleep with aio_poll.
In this case replay cannot be made, because of recursive mutex lock:
aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> 
  replay_run_block_event -> aio_poll -> qemu_clock_get_ns -> <lock replay mutex> ->
  <assert failed>

Pavel Dovgalyuk
Kevin Wolf Feb. 15, 2016, 3:01 p.m. UTC | #29
Am 15.02.2016 um 15:24 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > >
> > > > First of all, I'm not sure if running replay events from
> > > > qemu_clock_get_ns() is such a great idea. This is not a function that
> > > > callers expect to change any state. If you absolutely have to do it
> > > > there instead of in the clock device emulations, maybe restricting it to
> > > > replaying clock events could make it a bit more harmless.
> > >
> > > Only virtual clock is emulated, and host clock is read from the host
> > > real time sources and therefore has to be saved into the log.
> > 
> > Isn't the host clock invisible to the guest anyway?
> 
> It isn't. Host clock is used by guest RTC.
> 
> > > There could be asynchronous events that occur in non-cpu threads.
> > > For now these events are shutdown request and block task execution.
> > > They may "hide" following clock (or another one) events. That is why
> > > we process them until synchronous event (like clock, instructions
> > > execution, or checkpoint) is met.
> > >
> > >
> > > > Anyway, what does "can't proceed" mean? The coroutine yields because
> > > > it's waiting for I/O, but it is never reentered? Or is it hanging while
> > > > trying to acquire a lock?
> > >
> > > I've solved this problem by slightly modifying the queue.
> > > I haven't yet made BlockDriverState assignment to the request ids.
> > > Therefore aio_poll was temporarily replaced with usleep.
> > > Now execution starts and hangs at some random moment of OS loading.
> > >
> > > Here is the current version of blkreplay functions:
> > >
> > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
> > >     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> > > {
> > >     uint32_t reqid = request_id++;
> > >     Request *req;
> > >     req = block_request_insert(reqid, bs, qemu_coroutine_self());
> > >     bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);
> > >
> > >     if (replay_mode == REPLAY_MODE_RECORD) {
> > >         replay_save_block_event(reqid);
> > >     } else {
> > >         assert(replay_mode == REPLAY_MODE_PLAY);
> > >         qemu_coroutine_yield();
> > >     }
> > >     block_request_remove(req);
> > >
> > >     return 0;
> > > }
> > >
> > > void replay_run_block_event(uint32_t id)
> > > {
> > >     Request *req;
> > >     if (replay_mode == REPLAY_MODE_PLAY) {
> > >         while (!(req = block_request_find(id))) {
> > >             //aio_poll(bdrv_get_aio_context(req->bs), true);
> > >             usleep(1);
> > >         }
> > 
> > How is this loop supposed to make any progress?
> 
> This loop does not supposed to make any progress. It waits until block_request_insert
> call is added to the queue.

Yes. And who is supposed to add something to the queue? You are looping
and doing nothing, so no other code gets to run in this thread. It's
only aio_poll() that would run event handlers that could eventually add
something to the list.

> > And I still don't understand why aio_poll() doesn't work and where it
> > hangs.
> 
> aio_poll hangs if "req = block_request_insert(reqid, bs, qemu_coroutine_self());" line
> is executed after bdrv_co_readv. When bdrv_co_readv yields, replay_run_block_event has no
> information about pending request and cannot jump to its coroutine.
> Maybe I should implement aio_poll execution there to make progress in that case?

Up in the thread, I posted code that was a bit more complex than what
you have, and which considered both cases (completion before
replay/completion after replay). If you implement it like this, it might
just work. And yes, it involved an aio_poll() loop in the replay
function.

Kevin
Pavel Dovgalyuk Feb. 16, 2016, 6:25 a.m. UTC | #30
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 15.02.2016 um 15:24 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> >
> > > > There could be asynchronous events that occur in non-cpu threads.
> > > > For now these events are shutdown request and block task execution.
> > > > They may "hide" following clock (or another one) events. That is why
> > > > we process them until synchronous event (like clock, instructions
> > > > execution, or checkpoint) is met.
> > > >
> > > >
> > > > > Anyway, what does "can't proceed" mean? The coroutine yields because
> > > > > it's waiting for I/O, but it is never reentered? Or is it hanging while
> > > > > trying to acquire a lock?
> > > >
> > > > I've solved this problem by slightly modifying the queue.
> > > > I haven't yet made BlockDriverState assignment to the request ids.
> > > > Therefore aio_poll was temporarily replaced with usleep.
> > > > Now execution starts and hangs at some random moment of OS loading.
> > > >
> > > > Here is the current version of blkreplay functions:
> > > >
> > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
> > > >     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> > > > {
> > > >     uint32_t reqid = request_id++;
> > > >     Request *req;
> > > >     req = block_request_insert(reqid, bs, qemu_coroutine_self());
> > > >     bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);
> > > >
> > > >     if (replay_mode == REPLAY_MODE_RECORD) {
> > > >         replay_save_block_event(reqid);
> > > >     } else {
> > > >         assert(replay_mode == REPLAY_MODE_PLAY);
> > > >         qemu_coroutine_yield();
> > > >     }
> > > >     block_request_remove(req);
> > > >
> > > >     return 0;
> > > > }
> > > >
> > > > void replay_run_block_event(uint32_t id)
> > > > {
> > > >     Request *req;
> > > >     if (replay_mode == REPLAY_MODE_PLAY) {
> > > >         while (!(req = block_request_find(id))) {
> > > >             //aio_poll(bdrv_get_aio_context(req->bs), true);
> > > >             usleep(1);
> > > >         }
> > >
> > > How is this loop supposed to make any progress?
> >
> > This loop does not supposed to make any progress. It waits until block_request_insert
> > call is added to the queue.
> 
> Yes. And who is supposed to add something to the queue? You are looping
> and doing nothing, so no other code gets to run in this thread. It's
> only aio_poll() that would run event handlers that could eventually add
> something to the list.

blkreplay_co_readv adds request to the queue.

> 
> > > And I still don't understand why aio_poll() doesn't work and where it
> > > hangs.
> >
> > aio_poll hangs if "req = block_request_insert(reqid, bs, qemu_coroutine_self());" line
> > is executed after bdrv_co_readv. When bdrv_co_readv yields, replay_run_block_event has no
> > information about pending request and cannot jump to its coroutine.
> > Maybe I should implement aio_poll execution there to make progress in that case?
> 
> Up in the thread, I posted code that was a bit more complex than what
> you have, and which considered both cases (completion before
> replay/completion after replay). If you implement it like this, it might
> just work. And yes, it involved an aio_poll() loop in the replay
> function.

You are right, but I found kind of fundamental problem here.
There are two possible approaches for replaying coroutine events:

1. Waiting with aio_poll in replay_run_block_event.
   In this case replay cannot be made, because of recursive mutex lock:
   aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> 
     replay_run_block_event -> aio_poll -> qemu_clock_get_ns -> <lock replay mutex> ->
     <assert failed>
   I.e. we have recursive aio_poll function calls that lead to recursive replay calls
   and to recursive mutex lock.

2. Adding block events to the queue until checkpoint is met.
   This will not work with synchronous requests, because they will wait infinitely
   and checkpoint will never happen. Correct me, if I'm not right for this case.
   If blkreplay_co_readv will yield, can vcpu thread unlock the BQL and wait for 
   the checkpoint in iothread?

Pavel Dovgalyuk
Kevin Wolf Feb. 16, 2016, 10:02 a.m. UTC | #31
Am 16.02.2016 um 07:25 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 15.02.2016 um 15:24 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > >
> > > > > There could be asynchronous events that occur in non-cpu threads.
> > > > > For now these events are shutdown request and block task execution.
> > > > > They may "hide" following clock (or another one) events. That is why
> > > > > we process them until synchronous event (like clock, instructions
> > > > > execution, or checkpoint) is met.
> > > > >
> > > > >
> > > > > > Anyway, what does "can't proceed" mean? The coroutine yields because
> > > > > > it's waiting for I/O, but it is never reentered? Or is it hanging while
> > > > > > trying to acquire a lock?
> > > > >
> > > > > I've solved this problem by slightly modifying the queue.
> > > > > I haven't yet made BlockDriverState assignment to the request ids.
> > > > > Therefore aio_poll was temporarily replaced with usleep.
> > > > > Now execution starts and hangs at some random moment of OS loading.
> > > > >
> > > > > Here is the current version of blkreplay functions:
> > > > >
> > > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
> > > > >     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> > > > > {
> > > > >     uint32_t reqid = request_id++;
> > > > >     Request *req;
> > > > >     req = block_request_insert(reqid, bs, qemu_coroutine_self());
> > > > >     bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);
> > > > >
> > > > >     if (replay_mode == REPLAY_MODE_RECORD) {
> > > > >         replay_save_block_event(reqid);
> > > > >     } else {
> > > > >         assert(replay_mode == REPLAY_MODE_PLAY);
> > > > >         qemu_coroutine_yield();
> > > > >     }
> > > > >     block_request_remove(req);
> > > > >
> > > > >     return 0;
> > > > > }
> > > > >
> > > > > void replay_run_block_event(uint32_t id)
> > > > > {
> > > > >     Request *req;
> > > > >     if (replay_mode == REPLAY_MODE_PLAY) {
> > > > >         while (!(req = block_request_find(id))) {
> > > > >             //aio_poll(bdrv_get_aio_context(req->bs), true);
> > > > >             usleep(1);
> > > > >         }
> > > >
> > > > How is this loop supposed to make any progress?
> > >
> > > This loop does not supposed to make any progress. It waits until block_request_insert
> > > call is added to the queue.
> > 
> > Yes. And who is supposed to add something to the queue? You are looping
> > and doing nothing, so no other code gets to run in this thread. It's
> > only aio_poll() that would run event handlers that could eventually add
> > something to the list.
> 
> blkreplay_co_readv adds request to the queue.

I don't see blkreplay_co_readv() called in this loop without aio_poll().

> > > > And I still don't understand why aio_poll() doesn't work and where it
> > > > hangs.
> > >
> > > aio_poll hangs if "req = block_request_insert(reqid, bs, qemu_coroutine_self());" line
> > > is executed after bdrv_co_readv. When bdrv_co_readv yields, replay_run_block_event has no
> > > information about pending request and cannot jump to its coroutine.
> > > Maybe I should implement aio_poll execution there to make progress in that case?
> > 
> > Up in the thread, I posted code that was a bit more complex than what
> > you have, and which considered both cases (completion before
> > replay/completion after replay). If you implement it like this, it might
> > just work. And yes, it involved an aio_poll() loop in the replay
> > function.
> 
> You are right, but I found kind of fundamental problem here.
> There are two possible approaches for replaying coroutine events:
> 
> 1. Waiting with aio_poll in replay_run_block_event.
>    In this case replay cannot be made, because of recursive mutex lock:
>    aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> 
>      replay_run_block_event -> aio_poll -> qemu_clock_get_ns -> <lock replay mutex> ->
>      <assert failed>
>    I.e. we have recursive aio_poll function calls that lead to recursive replay calls
>    and to recursive mutex lock.

That's what you get for replaying stuff in qemu_clock_get_ns() when it
should really be replayed in the hardware emulation.

This is a problem you would also have had with your original patch, as
there are some places in the block layer that use aio_poll() internally.
Using the blkreplay driver only made you find the bug earlier.

But... can you detail how we get from aio_poll() to qemu_clock_get_ns()?
I don't see it directly calling that function, so it might just be a
callback to some guest device that is running now.

> 2. Adding block events to the queue until checkpoint is met.
>    This will not work with synchronous requests, because they will wait infinitely
>    and checkpoint will never happen. Correct me, if I'm not right for this case.
>    If blkreplay_co_readv will yield, can vcpu thread unlock the BQL and wait for 
>    the checkpoint in iothread?

I don't know enough checkpoints and about CPU emulation to give a
definite answer. This kind of blocking the I/O thread feels wrong,
though.

Kevin
Pavel Dovgalyuk Feb. 16, 2016, 11:20 a.m. UTC | #32
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 16.02.2016 um 07:25 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 15.02.2016 um 15:24 hat Pavel Dovgalyuk geschrieben:
> > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > Here is the current version of blkreplay functions:
> > > > > >
> > > > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
> > > > > >     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> > > > > > {
> > > > > >     uint32_t reqid = request_id++;
> > > > > >     Request *req;
> > > > > >     req = block_request_insert(reqid, bs, qemu_coroutine_self());
> > > > > >     bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);
> > > > > >
> > > > > >     if (replay_mode == REPLAY_MODE_RECORD) {
> > > > > >         replay_save_block_event(reqid);
> > > > > >     } else {
> > > > > >         assert(replay_mode == REPLAY_MODE_PLAY);
> > > > > >         qemu_coroutine_yield();
> > > > > >     }
> > > > > >     block_request_remove(req);
> > > > > >
> > > > > >     return 0;
> > > > > > }
> > > > > >
> > > > > > void replay_run_block_event(uint32_t id)
> > > > > > {
> > > > > >     Request *req;
> > > > > >     if (replay_mode == REPLAY_MODE_PLAY) {
> > > > > >         while (!(req = block_request_find(id))) {
> > > > > >             //aio_poll(bdrv_get_aio_context(req->bs), true);
> > > > > >             usleep(1);
> > > > > >         }
> > > > >
> > > > > How is this loop supposed to make any progress?
> > > >
> > > > This loop does not supposed to make any progress. It waits until block_request_insert
> > > > call is added to the queue.
> > >
> > > Yes. And who is supposed to add something to the queue? You are looping
> > > and doing nothing, so no other code gets to run in this thread. It's
> > > only aio_poll() that would run event handlers that could eventually add
> > > something to the list.
> >
> > blkreplay_co_readv adds request to the queue.
> 
> I don't see blkreplay_co_readv() called in this loop without aio_poll().

Is this mandatory?
I thought that version which you proposed has a race condition at the following lines
when AIO request is operated by a worker thread. Is this correct?

Coroutine                                                         Replay
bool *done = req_replayed_list_get(reqid) // NULL
                                                                  co =
req_completed_list_get(e.reqid); // NULL
req_completed_list_insert(reqid, qemu_coroutine_self());
qemu_coroutine_yield();
                                                                  bool done = false;
                                                                  req_replayed_list_insert(reqid,
&done);
                                                                  while (!done) { // nobody will
change done anymore
                                                                       aio_poll();
                                                                  }


> > You are right, but I found kind of fundamental problem here.
> > There are two possible approaches for replaying coroutine events:
> >
> > 1. Waiting with aio_poll in replay_run_block_event.
> >    In this case replay cannot be made, because of recursive mutex lock:
> >    aio_poll -> qemu_clock_get_ns -> <lock replay mutex> ->
> >      replay_run_block_event -> aio_poll -> qemu_clock_get_ns -> <lock replay mutex> ->
> >      <assert failed>
> >    I.e. we have recursive aio_poll function calls that lead to recursive replay calls
> >    and to recursive mutex lock.
> 
> That's what you get for replaying stuff in qemu_clock_get_ns() when it
> should really be replayed in the hardware emulation.

I'm not sure that it is possible for RTC. Host clock should run even if VM is stopped.
If we emulate host clock, its behavior should be somehow synchronized with the host.

> This is a problem you would also have had with your original patch, as
> there are some places in the block layer that use aio_poll() internally.
> Using the blkreplay driver only made you find the bug earlier.

Not exactly. Host clock replay will use cached clock value if there is no such event in the log.
The problem I observe here is with asynchronous block events that should be processed
immediately as they read from log.

> But... can you detail how we get from aio_poll() to qemu_clock_get_ns()?
> I don't see it directly calling that function, so it might just be a
> callback to some guest device that is running now.

aio_poll (win32) has the following line:

        timeout = blocking && !have_select_revents
            ? qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)) : 0;

POSIX version also uses that function. aio_compute_timeout() requests time from
all existing clocks by calling timerlistgroup_deadline_ns().


Pavel Dovgalyuk
Kevin Wolf Feb. 16, 2016, 12:54 p.m. UTC | #33
Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 16.02.2016 um 07:25 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > Am 15.02.2016 um 15:24 hat Pavel Dovgalyuk geschrieben:
> > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > > Here is the current version of blkreplay functions:
> > > > > > >
> > > > > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
> > > > > > >     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> > > > > > > {
> > > > > > >     uint32_t reqid = request_id++;
> > > > > > >     Request *req;
> > > > > > >     req = block_request_insert(reqid, bs, qemu_coroutine_self());
> > > > > > >     bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);
> > > > > > >
> > > > > > >     if (replay_mode == REPLAY_MODE_RECORD) {
> > > > > > >         replay_save_block_event(reqid);
> > > > > > >     } else {
> > > > > > >         assert(replay_mode == REPLAY_MODE_PLAY);
> > > > > > >         qemu_coroutine_yield();
> > > > > > >     }
> > > > > > >     block_request_remove(req);
> > > > > > >
> > > > > > >     return 0;
> > > > > > > }
> > > > > > >
> > > > > > > void replay_run_block_event(uint32_t id)
> > > > > > > {
> > > > > > >     Request *req;
> > > > > > >     if (replay_mode == REPLAY_MODE_PLAY) {
> > > > > > >         while (!(req = block_request_find(id))) {
> > > > > > >             //aio_poll(bdrv_get_aio_context(req->bs), true);
> > > > > > >             usleep(1);
> > > > > > >         }
> > > > > >
> > > > > > How is this loop supposed to make any progress?
> > > > >
> > > > > This loop does not supposed to make any progress. It waits until block_request_insert
> > > > > call is added to the queue.
> > > >
> > > > Yes. And who is supposed to add something to the queue? You are looping
> > > > and doing nothing, so no other code gets to run in this thread. It's
> > > > only aio_poll() that would run event handlers that could eventually add
> > > > something to the list.
> > >
> > > blkreplay_co_readv adds request to the queue.
> > 
> > I don't see blkreplay_co_readv() called in this loop without aio_poll().
> 
> Is this mandatory?

If you don't want it to be an endless loop, someone needs to make the
value of the loop condition change...

> I thought that version which you proposed has a race condition at the following lines
> when AIO request is operated by a worker thread. Is this correct?

Unfortunately this ended up linewrapped, trying to restore what you
meant:

> Coroutine                                                         Replay
> bool *done = req_replayed_list_get(reqid) // NULL
>                                                                   co = req_completed_list_get(e.reqid); // NULL

There was no yield, this context switch is impossible to happen. Same
for the switch back.

> req_completed_list_insert(reqid, qemu_coroutine_self());
> qemu_coroutine_yield();

This is the point at which a context switch happens. The only other
point in my code is the qemu_coroutine_enter() in the other function.

>                                                                   bool done = false;
>                                                                   req_replayed_list_insert(reqid, &done);
>                                                                   while (!done) { // nobody will change done anymore
>                                                                        aio_poll();
>                                                                   }
> 
> 
> > > You are right, but I found kind of fundamental problem here.
> > > There are two possible approaches for replaying coroutine events:
> > >
> > > 1. Waiting with aio_poll in replay_run_block_event.
> > >    In this case replay cannot be made, because of recursive mutex lock:
> > >    aio_poll -> qemu_clock_get_ns -> <lock replay mutex> ->
> > >      replay_run_block_event -> aio_poll -> qemu_clock_get_ns -> <lock replay mutex> ->
> > >      <assert failed>
> > >    I.e. we have recursive aio_poll function calls that lead to recursive replay calls
> > >    and to recursive mutex lock.
> > 
> > That's what you get for replaying stuff in qemu_clock_get_ns() when it
> > should really be replayed in the hardware emulation.
> 
> I'm not sure that it is possible for RTC. Host clock should run even if VM is stopped.
> If we emulate host clock, its behavior should be somehow synchronized with the host.

I think the RTC behaviour should be replayed, whereas the host clock
should be the right one.

Or maybe it would actually make most sense to forbid clock=host for
record/replay, because in the replay you don't get the host clock
anyway, but just saved values.

> > This is a problem you would also have had with your original patch, as
> > there are some places in the block layer that use aio_poll() internally.
> > Using the blkreplay driver only made you find the bug earlier.
> 
> Not exactly. Host clock replay will use cached clock value if there is no such event in the log.
> The problem I observe here is with asynchronous block events that should be processed
> immediately as they read from log.

Doesn't your earlier patch do the same? I seem to remember that it calls
bdrv_drain_all(), which is essentially the same as calling aio_poll() in
a loop, except that its loop condition is slightly different.

> > But... can you detail how we get from aio_poll() to qemu_clock_get_ns()?
> > I don't see it directly calling that function, so it might just be a
> > callback to some guest device that is running now.
> 
> aio_poll (win32) has the following line:
> 
>         timeout = blocking && !have_select_revents
>             ? qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)) : 0;
> 
> POSIX version also uses that function. aio_compute_timeout() requests time from
> all existing clocks by calling timerlistgroup_deadline_ns().

I see, thanks.

Kevin
Pavel Dovgalyuk Feb. 18, 2016, 9:18 a.m. UTC | #34
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben:
> > Coroutine                                                         Replay
> > bool *done = req_replayed_list_get(reqid) // NULL
> >                                                                   co =
> req_completed_list_get(e.reqid); // NULL
> 
> There was no yield, this context switch is impossible to happen. Same
> for the switch back.
> 
> > req_completed_list_insert(reqid, qemu_coroutine_self());
> > qemu_coroutine_yield();
> 
> This is the point at which a context switch happens. The only other
> point in my code is the qemu_coroutine_enter() in the other function.

I've fixed aio_poll problem by disabling mutex lock for the replay_run_block_event()
execution. Now virtual machine deterministically runs 4e8 instructions of Windows XP booting.

But then one non-deterministic event happens.
Callback after finishing coroutine may be called from different contexts.
apic_update_irq() function behaves differently being called from vcpu and io threads.
In one case it sets CPU_INTERRUPT_POLL and in other - nothing happens.

Therefore execution becomes non-deterministic.
In previous version of the patch I solved this problem by linking block events to the
execution checkpoints. IO thread have its own checkpoints and vcpu - its own.
Therefore apic callbacks are always called from the same thread in replay as in recording phase.

Pavel Dovgalyuk
Pavel Dovgalyuk Feb. 20, 2016, 7:11 a.m. UTC | #35
> From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben:
> > > Coroutine                                                         Replay
> > > bool *done = req_replayed_list_get(reqid) // NULL
> > >                                                                   co =
> > req_completed_list_get(e.reqid); // NULL
> >
> > There was no yield, this context switch is impossible to happen. Same
> > for the switch back.
> >
> > > req_completed_list_insert(reqid, qemu_coroutine_self());
> > > qemu_coroutine_yield();
> >
> > This is the point at which a context switch happens. The only other
> > point in my code is the qemu_coroutine_enter() in the other function.
> 
> I've fixed aio_poll problem by disabling mutex lock for the replay_run_block_event()
> execution. Now virtual machine deterministically runs 4e8 instructions of Windows XP booting.
> 
> But then one non-deterministic event happens.
> Callback after finishing coroutine may be called from different contexts.
> apic_update_irq() function behaves differently being called from vcpu and io threads.
> In one case it sets CPU_INTERRUPT_POLL and in other - nothing happens.

Kevin, do you have some ideas how to fix this issue?
This happens because of coroutines may be assigned to different threads.
Maybe there is some way of making this assignment more deterministic?

> Therefore execution becomes non-deterministic.
> In previous version of the patch I solved this problem by linking block events to the
> execution checkpoints. IO thread have its own checkpoints and vcpu - its own.
> Therefore apic callbacks are always called from the same thread in replay as in recording
> phase.

Pavel Dovgalyuk
Kevin Wolf Feb. 22, 2016, 11:06 a.m. UTC | #36
Am 20.02.2016 um 08:11 hat Pavel Dovgalyuk geschrieben:
> > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben:
> > > > Coroutine                                                         Replay
> > > > bool *done = req_replayed_list_get(reqid) // NULL
> > > >                                                                   co =
> > > req_completed_list_get(e.reqid); // NULL
> > >
> > > There was no yield, this context switch is impossible to happen. Same
> > > for the switch back.
> > >
> > > > req_completed_list_insert(reqid, qemu_coroutine_self());
> > > > qemu_coroutine_yield();
> > >
> > > This is the point at which a context switch happens. The only other
> > > point in my code is the qemu_coroutine_enter() in the other function.
> > 
> > I've fixed aio_poll problem by disabling mutex lock for the replay_run_block_event()
> > execution. Now virtual machine deterministically runs 4e8 instructions of Windows XP booting.

Are you sure that the lock was unnecessary? Solving deadlocks by
removing the lock is a rather adventurous method.

I assume that you're still replaying events from low-level functions
like qemu_clock_get_ns(). So even if you get rid of the hangs, the
result is probably not quite right.

I'm afraid this is going in a direction where my comments can't be more
constructive than a simple "you're doing it wrong".

> > But then one non-deterministic event happens.
> > Callback after finishing coroutine may be called from different contexts.

How does this happen? I'm not aware of callbacks being processed by any
thread other than the I/O thread for that specific block device (unless
you use dataplane, this is the main loop thread).

> > apic_update_irq() function behaves differently being called from vcpu and io threads.
> > In one case it sets CPU_INTERRUPT_POLL and in other - nothing happens.
> 
> Kevin, do you have some ideas how to fix this issue?
> This happens because of coroutines may be assigned to different threads.
> Maybe there is some way of making this assignment more deterministic?

Coroutines aren't randomly assigned to threads, but threads actively
enter coroutines. To my knowledge this happens only when starting a
request (either vcpu or I/O thread; consistent per device) or by a
callback when some event happens (only I/O thread). I can't see any
non-determinism here.

Kevin

> > Therefore execution becomes non-deterministic.
> > In previous version of the patch I solved this problem by linking block events to the
> > execution checkpoints. IO thread have its own checkpoints and vcpu - its own.
> > Therefore apic callbacks are always called from the same thread in replay as in recording
> > phase.
> 
> Pavel Dovgalyuk
>
Pavel Dovgalyuk Feb. 24, 2016, 11:59 a.m. UTC | #37
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 20.02.2016 um 08:11 hat Pavel Dovgalyuk geschrieben:
> > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben:
> > > > > Coroutine                                                         Replay
> > > > > bool *done = req_replayed_list_get(reqid) // NULL
> > > > >                                                                   co =
> > > > req_completed_list_get(e.reqid); // NULL
> > > >
> > > > There was no yield, this context switch is impossible to happen. Same
> > > > for the switch back.
> > > >
> > > > > req_completed_list_insert(reqid, qemu_coroutine_self());
> > > > > qemu_coroutine_yield();
> > > >
> > > > This is the point at which a context switch happens. The only other
> > > > point in my code is the qemu_coroutine_enter() in the other function.
> > >
> > > I've fixed aio_poll problem by disabling mutex lock for the replay_run_block_event()
> > > execution. Now virtual machine deterministically runs 4e8 instructions of Windows XP
> booting.
> > > But then one non-deterministic event happens.
> > > Callback after finishing coroutine may be called from different contexts.
> 
> How does this happen? I'm not aware of callbacks being processed by any
> thread other than the I/O thread for that specific block device (unless
> you use dataplane, this is the main loop thread).
> 
> > > apic_update_irq() function behaves differently being called from vcpu and io threads.
> > > In one case it sets CPU_INTERRUPT_POLL and in other - nothing happens.
> >
> > Kevin, do you have some ideas how to fix this issue?
> > This happens because of coroutines may be assigned to different threads.
> > Maybe there is some way of making this assignment more deterministic?
> 
> Coroutines aren't randomly assigned to threads, but threads actively
> enter coroutines. To my knowledge this happens only when starting a
> request (either vcpu or I/O thread; consistent per device) or by a
> callback when some event happens (only I/O thread). I can't see any
> non-determinism here.

Behavior of coroutines looks strange for me.
Consider the code below (co_readv function of the replay driver).
In record mode it somehow changes the thread it assigned to.
Code in point A is executed in CPU thread and code in point B - in some other thread.
May this happen because this coroutine yields somewhere and its execution is restored 
by aio_poll, which is called from iothread?
In this case event finishing callback cannot be executed deterministically
(always in CPU thread or always in IO thread).

static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
{
    BDRVBlkreplayState *s = bs->opaque;
    uint32_t reqid = request_id++;
    Request *req;
// A
    bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);

    if (replay_mode == REPLAY_MODE_RECORD) {
        replay_save_block_event(reqid);
    } else {
        assert(replay_mode == REPLAY_MODE_PLAY);
        if (reqid == current_request) {
            current_finished = true;
        } else {
            req = block_request_insert(reqid, bs, qemu_coroutine_self());
            qemu_coroutine_yield();
            block_request_remove(req);
        }
    }
// B
    return 0;
}

Pavel Dovgalyuk
Kevin Wolf Feb. 24, 2016, 1:14 p.m. UTC | #38
Am 24.02.2016 um 12:59 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 20.02.2016 um 08:11 hat Pavel Dovgalyuk geschrieben:
> > > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben:
> > > > > > Coroutine                                                         Replay
> > > > > > bool *done = req_replayed_list_get(reqid) // NULL
> > > > > >                                                                   co =
> > > > > req_completed_list_get(e.reqid); // NULL
> > > > >
> > > > > There was no yield, this context switch is impossible to happen. Same
> > > > > for the switch back.
> > > > >
> > > > > > req_completed_list_insert(reqid, qemu_coroutine_self());
> > > > > > qemu_coroutine_yield();
> > > > >
> > > > > This is the point at which a context switch happens. The only other
> > > > > point in my code is the qemu_coroutine_enter() in the other function.
> > > >
> > > > I've fixed aio_poll problem by disabling mutex lock for the replay_run_block_event()
> > > > execution. Now virtual machine deterministically runs 4e8 instructions of Windows XP
> > booting.
> > > > But then one non-deterministic event happens.
> > > > Callback after finishing coroutine may be called from different contexts.
> > 
> > How does this happen? I'm not aware of callbacks being processed by any
> > thread other than the I/O thread for that specific block device (unless
> > you use dataplane, this is the main loop thread).
> > 
> > > > apic_update_irq() function behaves differently being called from vcpu and io threads.
> > > > In one case it sets CPU_INTERRUPT_POLL and in other - nothing happens.
> > >
> > > Kevin, do you have some ideas how to fix this issue?
> > > This happens because of coroutines may be assigned to different threads.
> > > Maybe there is some way of making this assignment more deterministic?
> > 
> > Coroutines aren't randomly assigned to threads, but threads actively
> > enter coroutines. To my knowledge this happens only when starting a
> > request (either vcpu or I/O thread; consistent per device) or by a
> > callback when some event happens (only I/O thread). I can't see any
> > non-determinism here.
> 
> Behavior of coroutines looks strange for me.
> Consider the code below (co_readv function of the replay driver).
> In record mode it somehow changes the thread it assigned to.
> Code in point A is executed in CPU thread and code in point B - in some other thread.
> May this happen because this coroutine yields somewhere and its execution is restored 
> by aio_poll, which is called from iothread?
> In this case event finishing callback cannot be executed deterministically
> (always in CPU thread or always in IO thread).
> 
> static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
>     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> {
>     BDRVBlkreplayState *s = bs->opaque;
>     uint32_t reqid = request_id++;
>     Request *req;
> // A
>     bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);
> 
>     if (replay_mode == REPLAY_MODE_RECORD) {
>         replay_save_block_event(reqid);
>     } else {
>         assert(replay_mode == REPLAY_MODE_PLAY);
>         if (reqid == current_request) {
>             current_finished = true;
>         } else {
>             req = block_request_insert(reqid, bs, qemu_coroutine_self());
>             qemu_coroutine_yield();
>             block_request_remove(req);
>         }
>     }
> // B
>     return 0;
> }

Yes, I guess this can happen. As I described above, the coroutine can be
entered from a vcpu thread initially. After yielding for the first time,
it is resumed from the I/O thread. So if there are paths where the
coroutine never yields, the coroutine completes in the original vcpu
thread. (It's not the common case that bdrv_co_readv() doesn't yield,
but it happens e.g. with unallocated sectors in qcow2.)

If this is a problem for you, you need to force the coroutine into the
I/O thread. You can do that by scheduling a BH, then yield, and then let
the BH reenter the coroutine.

Kevin
Pavel Dovgalyuk Feb. 25, 2016, 9:06 a.m. UTC | #39
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Coroutines aren't randomly assigned to threads, but threads actively
> > > enter coroutines. To my knowledge this happens only when starting a
> > > request (either vcpu or I/O thread; consistent per device) or by a
> > > callback when some event happens (only I/O thread). I can't see any
> > > non-determinism here.
> >
> > Behavior of coroutines looks strange for me.
> > Consider the code below (co_readv function of the replay driver).
> > In record mode it somehow changes the thread it assigned to.
> > Code in point A is executed in CPU thread and code in point B - in some other thread.
> > May this happen because this coroutine yields somewhere and its execution is restored
> > by aio_poll, which is called from iothread?
> > In this case event finishing callback cannot be executed deterministically
> > (always in CPU thread or always in IO thread).
> >
> > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
> >     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> > {
> >     BDRVBlkreplayState *s = bs->opaque;
> >     uint32_t reqid = request_id++;
> >     Request *req;
> > // A
> >     bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);
> >
> >     if (replay_mode == REPLAY_MODE_RECORD) {
> >         replay_save_block_event(reqid);
> >     } else {
> >         assert(replay_mode == REPLAY_MODE_PLAY);
> >         if (reqid == current_request) {
> >             current_finished = true;
> >         } else {
> >             req = block_request_insert(reqid, bs, qemu_coroutine_self());
> >             qemu_coroutine_yield();
> >             block_request_remove(req);
> >         }
> >     }
> > // B
> >     return 0;
> > }
> 
> Yes, I guess this can happen. As I described above, the coroutine can be
> entered from a vcpu thread initially. After yielding for the first time,
> it is resumed from the I/O thread. So if there are paths where the
> coroutine never yields, the coroutine completes in the original vcpu
> thread. (It's not the common case that bdrv_co_readv() doesn't yield,
> but it happens e.g. with unallocated sectors in qcow2.)
> 
> If this is a problem for you, you need to force the coroutine into the
> I/O thread. You can do that by scheduling a BH, then yield, and then let
> the BH reenter the coroutine.

Thanks, this approach seems to work. I got rid of replay_run_block_event,
because BH basically does the same job.

There is one problem with flush event - callbacks for flush are called for
all layers and I couldn't synchronize them correctly yet.
I'll probably have to add new callback to block driver, which handles
flush request for the whole stack of the drivers.

Pavel Dovgalyuk
Kevin Wolf Feb. 26, 2016, 9:01 a.m. UTC | #40
Am 25.02.2016 um 10:06 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > Coroutines aren't randomly assigned to threads, but threads actively
> > > > enter coroutines. To my knowledge this happens only when starting a
> > > > request (either vcpu or I/O thread; consistent per device) or by a
> > > > callback when some event happens (only I/O thread). I can't see any
> > > > non-determinism here.
> > >
> > > Behavior of coroutines looks strange for me.
> > > Consider the code below (co_readv function of the replay driver).
> > > In record mode it somehow changes the thread it assigned to.
> > > Code in point A is executed in CPU thread and code in point B - in some other thread.
> > > May this happen because this coroutine yields somewhere and its execution is restored
> > > by aio_poll, which is called from iothread?
> > > In this case event finishing callback cannot be executed deterministically
> > > (always in CPU thread or always in IO thread).
> > >
> > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs,
> > >     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> > > {
> > >     BDRVBlkreplayState *s = bs->opaque;
> > >     uint32_t reqid = request_id++;
> > >     Request *req;
> > > // A
> > >     bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov);
> > >
> > >     if (replay_mode == REPLAY_MODE_RECORD) {
> > >         replay_save_block_event(reqid);
> > >     } else {
> > >         assert(replay_mode == REPLAY_MODE_PLAY);
> > >         if (reqid == current_request) {
> > >             current_finished = true;
> > >         } else {
> > >             req = block_request_insert(reqid, bs, qemu_coroutine_self());
> > >             qemu_coroutine_yield();
> > >             block_request_remove(req);
> > >         }
> > >     }
> > > // B
> > >     return 0;
> > > }
> > 
> > Yes, I guess this can happen. As I described above, the coroutine can be
> > entered from a vcpu thread initially. After yielding for the first time,
> > it is resumed from the I/O thread. So if there are paths where the
> > coroutine never yields, the coroutine completes in the original vcpu
> > thread. (It's not the common case that bdrv_co_readv() doesn't yield,
> > but it happens e.g. with unallocated sectors in qcow2.)
> > 
> > If this is a problem for you, you need to force the coroutine into the
> > I/O thread. You can do that by scheduling a BH, then yield, and then let
> > the BH reenter the coroutine.
> 
> Thanks, this approach seems to work. I got rid of replay_run_block_event,
> because BH basically does the same job.
> 
> There is one problem with flush event - callbacks for flush are called for
> all layers and I couldn't synchronize them correctly yet.
> I'll probably have to add new callback to block driver, which handles
> flush request for the whole stack of the drivers.

Flushes should be treated more or less the same a writes, I think.

Kevin
diff mbox

Patch

diff --git a/block/block-backend.c b/block/block-backend.c
index f41d326..d8820b9 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -16,6 +16,7 @@ 
 #include "block/throttle-groups.h"
 #include "sysemu/blockdev.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/replay.h"
 #include "qapi-event.h"
 
 /* Number of coroutines to reserve per attached device model */
@@ -655,14 +656,14 @@  BlockAIOCB *blk_abort_aio_request(BlockBackend *blk,
 
     bh = aio_bh_new(blk_get_aio_context(blk), error_callback_bh, acb);
     acb->bh = bh;
-    qemu_bh_schedule(bh);
+    replay_bh_schedule_event(bh);
 
     return &acb->common;
 }
 
-BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
-                                 int nb_sectors, BdrvRequestFlags flags,
-                                 BlockCompletionFunc *cb, void *opaque)
+BlockAIOCB *blk_aio_write_zeroes_impl(BlockBackend *blk, int64_t sector_num,
+                                      int nb_sectors, BdrvRequestFlags flags,
+                                      BlockCompletionFunc *cb, void *opaque)
 {
     int ret = blk_check_request(blk, sector_num, nb_sectors);
     if (ret < 0) {
@@ -673,6 +674,13 @@  BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
                                  cb, opaque);
 }
 
+BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
+                                 int nb_sectors, BdrvRequestFlags flags,
+                                 BlockCompletionFunc *cb, void *opaque)
+{
+    return replay_aio_write_zeroes(blk, sector_num, nb_sectors, flags, cb, opaque);
+}
+
 int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
 {
     int ret = blk_check_byte_request(blk, offset, count);
@@ -720,7 +728,7 @@  int64_t blk_nb_sectors(BlockBackend *blk)
     return bdrv_nb_sectors(blk->bs);
 }
 
-BlockAIOCB *blk_aio_readv(BlockBackend *blk, int64_t sector_num,
+BlockAIOCB *blk_aio_readv_impl(BlockBackend *blk, int64_t sector_num,
                           QEMUIOVector *iov, int nb_sectors,
                           BlockCompletionFunc *cb, void *opaque)
 {
@@ -732,9 +740,16 @@  BlockAIOCB *blk_aio_readv(BlockBackend *blk, int64_t sector_num,
     return bdrv_aio_readv(blk->bs, sector_num, iov, nb_sectors, cb, opaque);
 }
 
-BlockAIOCB *blk_aio_writev(BlockBackend *blk, int64_t sector_num,
-                           QEMUIOVector *iov, int nb_sectors,
-                           BlockCompletionFunc *cb, void *opaque)
+BlockAIOCB *blk_aio_readv(BlockBackend *blk, int64_t sector_num,
+                          QEMUIOVector *iov, int nb_sectors,
+                          BlockCompletionFunc *cb, void *opaque)
+{
+    return replay_aio_readv(blk, sector_num, iov, nb_sectors, cb, opaque);
+}
+
+BlockAIOCB *blk_aio_writev_impl(BlockBackend *blk, int64_t sector_num,
+                                QEMUIOVector *iov, int nb_sectors,
+                                BlockCompletionFunc *cb, void *opaque)
 {
     int ret = blk_check_request(blk, sector_num, nb_sectors);
     if (ret < 0) {
@@ -744,8 +759,15 @@  BlockAIOCB *blk_aio_writev(BlockBackend *blk, int64_t sector_num,
     return bdrv_aio_writev(blk->bs, sector_num, iov, nb_sectors, cb, opaque);
 }
 
-BlockAIOCB *blk_aio_flush(BlockBackend *blk,
-                          BlockCompletionFunc *cb, void *opaque)
+BlockAIOCB *blk_aio_writev(BlockBackend *blk, int64_t sector_num,
+                           QEMUIOVector *iov, int nb_sectors,
+                           BlockCompletionFunc *cb, void *opaque)
+{
+    return replay_aio_writev(blk, sector_num, iov, nb_sectors, cb, opaque);
+}
+
+BlockAIOCB *blk_aio_flush_impl(BlockBackend *blk,
+                               BlockCompletionFunc *cb, void *opaque)
 {
     if (!blk_is_available(blk)) {
         return blk_abort_aio_request(blk, cb, opaque, -ENOMEDIUM);
@@ -754,9 +776,15 @@  BlockAIOCB *blk_aio_flush(BlockBackend *blk,
     return bdrv_aio_flush(blk->bs, cb, opaque);
 }
 
-BlockAIOCB *blk_aio_discard(BlockBackend *blk,
-                            int64_t sector_num, int nb_sectors,
-                            BlockCompletionFunc *cb, void *opaque)
+BlockAIOCB *blk_aio_flush(BlockBackend *blk,
+                          BlockCompletionFunc *cb, void *opaque)
+{
+    return replay_aio_flush(blk, cb, opaque);
+}
+                          
+BlockAIOCB *blk_aio_discard_impl(BlockBackend *blk,
+                                 int64_t sector_num, int nb_sectors,
+                                 BlockCompletionFunc *cb, void *opaque)
 {
     int ret = blk_check_request(blk, sector_num, nb_sectors);
     if (ret < 0) {
@@ -766,6 +794,13 @@  BlockAIOCB *blk_aio_discard(BlockBackend *blk,
     return bdrv_aio_discard(blk->bs, sector_num, nb_sectors, cb, opaque);
 }
 
+BlockAIOCB *blk_aio_discard(BlockBackend *blk,
+                            int64_t sector_num, int nb_sectors,
+                            BlockCompletionFunc *cb, void *opaque)
+{
+    return replay_aio_discard(blk, sector_num, nb_sectors, cb, opaque);
+}
+
 void blk_aio_cancel(BlockAIOCB *acb)
 {
     bdrv_aio_cancel(acb);
@@ -799,8 +834,8 @@  int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf)
     return bdrv_ioctl(blk->bs, req, buf);
 }
 
-BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf,
-                          BlockCompletionFunc *cb, void *opaque)
+BlockAIOCB *blk_aio_ioctl_impl(BlockBackend *blk, unsigned long int req, void *buf,
+                               BlockCompletionFunc *cb, void *opaque)
 {
     if (!blk_is_available(blk)) {
         return blk_abort_aio_request(blk, cb, opaque, -ENOMEDIUM);
@@ -809,6 +844,12 @@  BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf,
     return bdrv_aio_ioctl(blk->bs, req, buf, cb, opaque);
 }
 
+BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf,
+                          BlockCompletionFunc *cb, void *opaque)
+{
+    return replay_aio_ioctl(blk, req, buf, cb, opaque);
+}
+
 int blk_co_discard(BlockBackend *blk, int64_t sector_num, int nb_sectors)
 {
     int ret = blk_check_request(blk, sector_num, nb_sectors);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index dc24476..cf05d56 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -100,6 +100,9 @@  int blk_write_zeroes(BlockBackend *blk, int64_t sector_num,
 BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
                                  int nb_sectors, BdrvRequestFlags flags,
                                  BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_write_zeroes_impl(BlockBackend *blk, int64_t sector_num,
+                                      int nb_sectors, BdrvRequestFlags flags,
+                                      BlockCompletionFunc *cb, void *opaque);
 int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count);
 int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int count);
 int64_t blk_getlength(BlockBackend *blk);
@@ -108,20 +111,33 @@  int64_t blk_nb_sectors(BlockBackend *blk);
 BlockAIOCB *blk_aio_readv(BlockBackend *blk, int64_t sector_num,
                           QEMUIOVector *iov, int nb_sectors,
                           BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_readv_impl(BlockBackend *blk, int64_t sector_num,
+                          QEMUIOVector *iov, int nb_sectors,
+                          BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *blk_aio_writev(BlockBackend *blk, int64_t sector_num,
                            QEMUIOVector *iov, int nb_sectors,
                            BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_writev_impl(BlockBackend *blk, int64_t sector_num,
+                                QEMUIOVector *iov, int nb_sectors,
+                                BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *blk_aio_flush(BlockBackend *blk,
                           BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_flush_impl(BlockBackend *blk,
+                               BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *blk_aio_discard(BlockBackend *blk,
                             int64_t sector_num, int nb_sectors,
                             BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_discard_impl(BlockBackend *blk,
+                                 int64_t sector_num, int nb_sectors,
+                                 BlockCompletionFunc *cb, void *opaque);
 void blk_aio_cancel(BlockAIOCB *acb);
 void blk_aio_cancel_async(BlockAIOCB *acb);
 int blk_aio_multiwrite(BlockBackend *blk, BlockRequest *reqs, int num_reqs);
 int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
 BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf,
                           BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_ioctl_impl(BlockBackend *blk, unsigned long int req, void *buf,
+                               BlockCompletionFunc *cb, void *opaque);
 int blk_co_discard(BlockBackend *blk, int64_t sector_num, int nb_sectors);
 int blk_co_flush(BlockBackend *blk);
 int blk_flush(BlockBackend *blk);
diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
index c879231..c06e355 100644
--- a/include/sysemu/replay.h
+++ b/include/sysemu/replay.h
@@ -17,6 +17,8 @@ 
 #include "qapi-types.h"
 #include "qapi/error.h"
 #include "qemu/typedefs.h"
+#include "block/aio.h"
+#include "block/block.h"
 
 /* replay clock kinds */
 enum ReplayClockKind {
@@ -125,6 +127,31 @@  void replay_register_char_driver(struct CharDriverState *chr);
 /*! Saves write to char device event to the log */
 void replay_chr_be_write(struct CharDriverState *s, uint8_t *buf, int len);
 
+/* Block devices */
+
+/*! Asynchronous read function. Adds read request to the queue. */
+BlockAIOCB *replay_aio_readv(BlockBackend *blk, int64_t sector_num,
+                             QEMUIOVector *iov, int nb_sectors,
+                             BlockCompletionFunc *cb, void *opaque);
+/*! Asynchronous write function. Adds write request to the queue. */
+BlockAIOCB *replay_aio_writev(BlockBackend *blk, int64_t sector_num,
+                              QEMUIOVector *iov, int nb_sectors,
+                              BlockCompletionFunc *cb, void *opaque);
+/*! Asynchronous write zeroes function. Adds this request to the queue. */
+BlockAIOCB *replay_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
+                                    int nb_sectors, BdrvRequestFlags flags,
+                                    BlockCompletionFunc *cb, void *opaque);
+/*! Asynchronous flush function. Adds flush request to the queue. */
+BlockAIOCB *replay_aio_flush(BlockBackend *blk,
+                             BlockCompletionFunc *cb, void *opaque);
+/*! Asynchronous discard function. Adds discard request to the queue. */
+BlockAIOCB *replay_aio_discard(BlockBackend *blk, int64_t sector_num,
+                               int nb_sectors, BlockCompletionFunc *cb,
+                               void *opaque);
+/*! Asynchronous ioctl function. Adds ioctl request to the queue. */
+BlockAIOCB *replay_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf,
+                             BlockCompletionFunc *cb, void *opaque);
+
 /* Other data */
 
 /*! Writes or reads integer value to/from replay log. */
diff --git a/replay/Makefile.objs b/replay/Makefile.objs
index 70e5572..050c0ea 100644
--- a/replay/Makefile.objs
+++ b/replay/Makefile.objs
@@ -4,3 +4,4 @@  common-obj-y += replay-events.o
 common-obj-y += replay-time.o
 common-obj-y += replay-input.o
 common-obj-y += replay-char.o
+common-obj-y += replay-block.o
diff --git a/replay/replay-block.c b/replay/replay-block.c
new file mode 100755
index 0000000..7fc68cf
--- /dev/null
+++ b/replay/replay-block.c
@@ -0,0 +1,202 @@ 
+/*
+ * replay-block.c
+ *
+ * Copyright (c) 2010-2016 Institute for System Programming
+ *                         of the Russian Academy of Sciences.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+#include "sysemu/replay.h"
+#include "sysemu/block-backend.h"
+#include "replay-internal.h"
+
+static void replay_aio_cancel(BlockAIOCB *acb);
+
+typedef struct ReplayAIOCB {
+    BlockAIOCB common;
+    BlockRequest req;
+    BlockBackend *blk;
+    ReplayAsyncEventKind kind;
+} ReplayAIOCB;
+
+static const AIOCBInfo replay_aiocb_info = {
+    .aiocb_size   = sizeof(ReplayAIOCB),
+    .cancel_async = replay_aio_cancel
+};
+
+static ReplayAIOCB *replay_aio_create(ReplayAsyncEventKind kind,
+                                      BlockBackend *blk,
+                                      BlockCompletionFunc *cb, void *opaque)
+{
+    ReplayAIOCB *acb = g_malloc0(sizeof(*acb));
+    acb->kind = kind;
+    /* Make correct AIOCB to return to the caller */
+    acb->common.aiocb_info = &replay_aiocb_info;
+    acb->common.bs = blk_bs(blk);
+    acb->common.cb = cb;
+    acb->common.opaque = opaque;
+    acb->common.refcnt = 1;
+    /* Data to restore and pass request to the bdrv */
+    acb->blk = blk;
+    acb->req.error = -EINPROGRESS;
+    return acb;
+}
+
+BlockAIOCB *replay_aio_readv(BlockBackend *blk, int64_t sector_num,
+                             QEMUIOVector *iov, int nb_sectors,
+                             BlockCompletionFunc *cb, void *opaque)
+{
+    if (replay_mode == REPLAY_MODE_NONE) {
+        return blk_aio_readv_impl(blk, sector_num, iov, nb_sectors, cb, opaque);
+    } else {
+        ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_READV,
+                                             blk, cb, opaque);
+        acb->req.sector = sector_num;
+        acb->req.nb_sectors = nb_sectors;
+        acb->req.qiov = iov;
+        replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_READV, acb, NULL, 0);
+
+        return &acb->common;
+    }
+}
+
+void replay_event_block_readv_run(void *opaque)
+{
+    ReplayAIOCB *acb = opaque;
+    blk_aio_readv_impl(acb->blk, acb->req.sector, acb->req.qiov,
+                       acb->req.nb_sectors, acb->common.cb,
+                       acb->common.opaque);
+    blk_drain_all();
+}
+
+BlockAIOCB *replay_aio_writev(BlockBackend *blk, int64_t sector_num,
+                              QEMUIOVector *iov, int nb_sectors,
+                              BlockCompletionFunc *cb, void *opaque)
+{
+    if (replay_mode == REPLAY_MODE_NONE) {
+        return blk_aio_writev_impl(blk, sector_num, iov, nb_sectors, cb, opaque);
+    } else {
+        ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_WRITEV,
+                                             blk, cb, opaque);
+        acb->req.sector = sector_num;
+        acb->req.nb_sectors = nb_sectors;
+        acb->req.qiov = iov;
+        replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_WRITEV, acb, NULL, 0);
+
+        return &acb->common;
+    }
+}
+
+void replay_event_block_writev_run(void *opaque)
+{
+    ReplayAIOCB *acb = opaque;
+    blk_aio_writev_impl(acb->blk, acb->req.sector, acb->req.qiov,
+                        acb->req.nb_sectors, acb->common.cb,
+                        acb->common.opaque);
+    blk_drain_all();
+}
+
+BlockAIOCB *replay_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
+                                    int nb_sectors, BdrvRequestFlags flags,
+                                    BlockCompletionFunc *cb, void *opaque)
+{
+    if (replay_mode == REPLAY_MODE_NONE) {
+        return blk_aio_write_zeroes_impl(blk, sector_num, nb_sectors, flags, cb, opaque);
+    } else {
+        ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES,
+                                             blk, cb, opaque);
+        acb->req.sector = sector_num;
+        acb->req.nb_sectors = nb_sectors;
+        acb->req.flags = flags;
+        replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES, acb, NULL, 0);
+
+        return &acb->common;
+    }
+}
+
+void replay_event_block_write_zeroes_run(void *opaque)
+{
+    ReplayAIOCB *acb = opaque;
+    blk_aio_write_zeroes_impl(acb->blk, acb->req.sector, acb->req.nb_sectors,
+                              acb->req.flags, acb->common.cb,
+                              acb->common.opaque);
+    blk_drain_all();
+}
+
+BlockAIOCB *replay_aio_flush(BlockBackend *blk,
+                             BlockCompletionFunc *cb, void *opaque)
+{
+    if (replay_mode == REPLAY_MODE_NONE) {
+        return blk_aio_flush_impl(blk, cb, opaque);
+    } else {
+        ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_FLUSH,
+                                             blk, cb, opaque);
+        replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_FLUSH, acb, NULL, 0);
+
+        return &acb->common;
+    }
+}
+
+void replay_event_block_flush_run(void *opaque)
+{
+    ReplayAIOCB *acb = opaque;
+    blk_aio_flush_impl(acb->blk, acb->common.cb, acb->common.opaque);
+    blk_drain_all();
+}
+
+static void replay_aio_cancel(BlockAIOCB *acb)
+{
+    ReplayAIOCB *racb = container_of(acb, ReplayAIOCB, common);;
+    replay_remove_event(racb->kind, acb, NULL, 0);
+}
+
+BlockAIOCB *replay_aio_discard(BlockBackend *blk, int64_t sector_num,
+                               int nb_sectors, BlockCompletionFunc *cb,
+                               void *opaque)
+{
+    if (replay_mode == REPLAY_MODE_NONE) {
+        return blk_aio_discard_impl(blk, sector_num, nb_sectors, cb, opaque);
+    } else {
+        ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_DISCARD,
+                                             blk, cb, opaque);
+        acb->req.sector = sector_num;
+        acb->req.nb_sectors = nb_sectors;
+        replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_DISCARD, acb, NULL, 0);
+
+        return &acb->common;
+    }
+}
+
+void replay_event_block_discard_run(void *opaque)
+{
+    ReplayAIOCB *acb = opaque;
+    blk_aio_discard_impl(acb->blk, acb->req.sector, acb->req.nb_sectors,
+                         acb->common.cb, acb->common.opaque);
+    blk_drain_all();
+}
+
+BlockAIOCB *replay_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf,
+                             BlockCompletionFunc *cb, void *opaque)
+{
+    if (replay_mode == REPLAY_MODE_NONE) {
+        return blk_aio_ioctl_impl(blk, req, buf, cb, opaque);
+    } else {
+        ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_IOCTL,
+                                             blk, cb, opaque);
+        acb->req.req = req;
+        acb->req.buf = buf;
+        replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_IOCTL, acb, NULL, 0);
+
+        return &acb->common;
+    }
+}
+
+void replay_event_block_ioctl_run(void *opaque)
+{
+    ReplayAIOCB *acb = opaque;
+    blk_aio_ioctl_impl(acb->blk, acb->req.req, acb->req.buf,
+                       acb->common.cb, acb->common.opaque);
+    blk_drain_all();
+}
diff --git a/replay/replay-events.c b/replay/replay-events.c
index 7fc7ed0..1551257 100644
--- a/replay/replay-events.c
+++ b/replay/replay-events.c
@@ -50,6 +50,24 @@  static void replay_run_event(Event *event)
     case REPLAY_ASYNC_EVENT_CHAR:
         replay_event_char_run(event->opaque);
         break;
+    case REPLAY_ASYNC_EVENT_BLOCK_READV:
+        replay_event_block_readv_run(event->opaque);
+        break;
+    case REPLAY_ASYNC_EVENT_BLOCK_WRITEV:
+        replay_event_block_writev_run(event->opaque);
+        break;
+    case REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES:
+        replay_event_block_write_zeroes_run(event->opaque);
+        break;
+    case REPLAY_ASYNC_EVENT_BLOCK_FLUSH:
+        replay_event_block_flush_run(event->opaque);
+        break;
+    case REPLAY_ASYNC_EVENT_BLOCK_DISCARD:
+        replay_event_block_discard_run(event->opaque);
+        break;
+    case REPLAY_ASYNC_EVENT_BLOCK_IOCTL:
+        replay_event_block_ioctl_run(event->opaque);
+        break;
     default:
         error_report("Replay: invalid async event ID (%d) in the queue",
                     event->event_kind);
@@ -132,6 +150,24 @@  void replay_add_event(ReplayAsyncEventKind event_kind,
     replay_mutex_unlock();
 }
 
+void replay_remove_event(ReplayAsyncEventKind event_kind,
+                         void *opaque,
+                         void *opaque2, uint64_t id)
+{
+    Event *event;
+    replay_mutex_lock();
+    QTAILQ_FOREACH(event, &events_list, events) {
+        if (event->event_kind == event_kind
+            && event->opaque == opaque
+            && event->opaque2 == opaque2
+            && event->id == id) {
+            QTAILQ_REMOVE(&events_list, event, events);
+            break;
+        }
+    }
+    replay_mutex_unlock();
+}
+
 void replay_bh_schedule_event(QEMUBH *bh)
 {
     if (replay_mode != REPLAY_MODE_NONE) {
@@ -173,6 +209,13 @@  static void replay_save_event(Event *event, int checkpoint)
         case REPLAY_ASYNC_EVENT_CHAR:
             replay_event_char_save(event->opaque);
             break;
+        case REPLAY_ASYNC_EVENT_BLOCK_READV:
+        case REPLAY_ASYNC_EVENT_BLOCK_WRITEV:
+        case REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES:
+        case REPLAY_ASYNC_EVENT_BLOCK_FLUSH:
+        case REPLAY_ASYNC_EVENT_BLOCK_DISCARD:
+        case REPLAY_ASYNC_EVENT_BLOCK_IOCTL:
+            break;
         default:
             error_report("Unknown ID %d of replay event", read_event_kind);
             exit(1);
@@ -231,6 +274,14 @@  static Event *replay_read_event(int checkpoint)
         event->event_kind = read_event_kind;
         event->opaque = replay_event_char_read();
         return event;
+    case REPLAY_ASYNC_EVENT_BLOCK_READV:
+    case REPLAY_ASYNC_EVENT_BLOCK_WRITEV:
+    case REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES:
+    case REPLAY_ASYNC_EVENT_BLOCK_FLUSH:
+    case REPLAY_ASYNC_EVENT_BLOCK_DISCARD:
+    case REPLAY_ASYNC_EVENT_BLOCK_IOCTL:
+        /* search in the queue */
+        break;
     default:
         error_report("Unknown ID %d of replay event", read_event_kind);
         exit(1);
diff --git a/replay/replay-internal.h b/replay/replay-internal.h
index 34eee5b..8f705d1 100644
--- a/replay/replay-internal.h
+++ b/replay/replay-internal.h
@@ -47,6 +47,12 @@  enum ReplayAsyncEventKind {
     REPLAY_ASYNC_EVENT_INPUT,
     REPLAY_ASYNC_EVENT_INPUT_SYNC,
     REPLAY_ASYNC_EVENT_CHAR,
+    REPLAY_ASYNC_EVENT_BLOCK_READV,
+    REPLAY_ASYNC_EVENT_BLOCK_WRITEV,
+    REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES,
+    REPLAY_ASYNC_EVENT_BLOCK_FLUSH,
+    REPLAY_ASYNC_EVENT_BLOCK_DISCARD,
+    REPLAY_ASYNC_EVENT_BLOCK_IOCTL,
     REPLAY_ASYNC_COUNT
 };
 
@@ -131,6 +137,9 @@  void replay_read_events(int checkpoint);
 /*! Adds specified async event to the queue */
 void replay_add_event(ReplayAsyncEventKind event_kind, void *opaque,
                       void *opaque2, uint64_t id);
+/*! Removes specified async event from the queue */
+void replay_remove_event(ReplayAsyncEventKind event_kind,
+                         void *opaque, void *opaque2, uint64_t id);
 
 /* Input events */
 
@@ -152,4 +161,19 @@  void replay_event_char_save(void *opaque);
 /*! Reads char event from the file. */
 void *replay_event_char_read(void);
 
+/* Block devices */
+
+/*! Called to run block device readv event. */
+void replay_event_block_readv_run(void *opaque);
+/*! Called to run block device writev event. */
+void replay_event_block_writev_run(void *opaque);
+/*! Called to run block device write zeroes event. */
+void replay_event_block_write_zeroes_run(void *opaque);
+/*! Called to run block device flush event. */
+void replay_event_block_flush_run(void *opaque);
+/*! Called to run block device discard event. */
+void replay_event_block_discard_run(void *opaque);
+/*! Called to run block device ioctl event. */
+void replay_event_block_ioctl_run(void *opaque);
+
 #endif
diff --git a/stubs/replay.c b/stubs/replay.c
index 8f98790..54a5f61 100644
--- a/stubs/replay.c
+++ b/stubs/replay.c
@@ -1,6 +1,7 @@ 
 #include "sysemu/replay.h"
 #include <stdlib.h>
 #include "sysemu/sysemu.h"
+#include "sysemu/block-backend.h"
 
 ReplayMode replay_mode;
 
@@ -29,3 +30,48 @@  bool replay_events_enabled(void)
 void replay_finish(void)
 {
 }
+
+BlockAIOCB *replay_aio_readv(BlockBackend *blk, int64_t sector_num,
+                             QEMUIOVector *iov, int nb_sectors,
+                             BlockCompletionFunc *cb, void *opaque)
+{
+    return blk_aio_readv_impl(blk, sector_num, iov, nb_sectors, cb, opaque);
+}
+
+BlockAIOCB *replay_aio_writev(BlockBackend *blk, int64_t sector_num,
+                              QEMUIOVector *iov, int nb_sectors,
+                              BlockCompletionFunc *cb, void *opaque)
+{
+    return blk_aio_writev_impl(blk, sector_num, iov, nb_sectors, cb, opaque);
+}
+
+BlockAIOCB *replay_aio_flush(BlockBackend *blk,
+                             BlockCompletionFunc *cb, void *opaque)
+{
+    return blk_aio_flush_impl(blk, cb, opaque);
+}
+
+BlockAIOCB *replay_aio_discard(BlockBackend *blk,
+                               int64_t sector_num, int nb_sectors,
+                               BlockCompletionFunc *cb, void *opaque)
+{
+    return blk_aio_discard_impl(blk, sector_num, nb_sectors, cb, opaque);
+}
+
+BlockAIOCB *replay_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf,
+                             BlockCompletionFunc *cb, void *opaque)
+{
+    return blk_aio_ioctl_impl(blk, req, buf, cb, opaque);
+}
+
+BlockAIOCB *replay_aio_write_zeroes(BlockBackend *blk, int64_t sector_num,
+                                    int nb_sectors, BdrvRequestFlags flags,
+                                    BlockCompletionFunc *cb, void *opaque)
+{
+    return blk_aio_write_zeroes_impl(blk, sector_num, nb_sectors, flags, cb, opaque);
+}
+
+void replay_bh_schedule_event(QEMUBH *bh)
+{
+    qemu_bh_schedule(bh);
+}