Message ID | 20160209055524.8208.16023.stgit@PASHA-ISP (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Am 09.02.2016 um 06:55 hat Pavel Dovgalyuk geschrieben: > This patch introduces a set of functions that implement recording > and replaying of block devices' operations. These functions form a thin > layer between blk_aio_ functions and replay subsystem. > All asynchronous block requests are added to the queue to be processed > at deterministically invoked record/replay checkpoints. > Queue is flushed at checkpoints and information about processed requests > is recorded to the log. In replay phase the queue is matched with > events read from the log. Therefore block devices requests are processed > deterministically. > > Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> This series doesn't seem to apply to current master, so it's somewhat hard to look at the end result. I can see just from patches, though, that this will need some more discussion. Just picking one example of how you convert blk_* functions: > -BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, > - int nb_sectors, BdrvRequestFlags flags, > - BlockCompletionFunc *cb, void *opaque) > +BlockAIOCB *blk_aio_write_zeroes_impl(BlockBackend *blk, int64_t sector_num, > + int nb_sectors, BdrvRequestFlags flags, > + BlockCompletionFunc *cb, void *opaque) > { > int ret = blk_check_request(blk, sector_num, nb_sectors); > if (ret < 0) { > @@ -673,6 +674,13 @@ BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, > cb, opaque); > } > > +BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, > + int nb_sectors, BdrvRequestFlags flags, > + BlockCompletionFunc *cb, void *opaque) > +{ > + return replay_aio_write_zeroes(blk, sector_num, nb_sectors, flags, cb, opaque); > +} > + > +BlockAIOCB *replay_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, > + int nb_sectors, BdrvRequestFlags flags, > + BlockCompletionFunc *cb, void *opaque) > +{ > + if (replay_mode == REPLAY_MODE_NONE) { > + return blk_aio_write_zeroes_impl(blk, sector_num, nb_sectors, flags, cb, opaque); > + } else { > + ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES, > + blk, cb, opaque); > + acb->req.sector = sector_num; > + acb->req.nb_sectors = nb_sectors; > + acb->req.flags = flags; > + replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES, acb, NULL, 0); > + > + return &acb->common; > + } > +} I think it's obvious that adding two functions to the call chain which do nothing in the common case is a bit ugly. If we did this for every feature that could possibly be enabled, we'd end up with two-kilometer stack traces. So definitely don't call into replay.c, which just calls back in 99.9% of the cases, but if anything, do the check in block-backends.c. But even this doesn't feel completely right, because block drivers are already layered and there is no need to hardcode something optional (and rarely used) in the hot code path that could just be another layer. I assume that you know beforehand if you want to replay something, so requiring you to configure your block devices with a replay driver on top of the stack seems reasonable enough. Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 09.02.2016 um 06:55 hat Pavel Dovgalyuk geschrieben: > > This patch introduces a set of functions that implement recording > > and replaying of block devices' operations. These functions form a thin > > layer between blk_aio_ functions and replay subsystem. > > All asynchronous block requests are added to the queue to be processed > > at deterministically invoked record/replay checkpoints. > > Queue is flushed at checkpoints and information about processed requests > > is recorded to the log. In replay phase the queue is matched with > > events read from the log. Therefore block devices requests are processed > > deterministically. > > > > Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> > > This series doesn't seem to apply to current master, so it's somewhat > hard to look at the end result. I can see just from patches, though, > that this will need some more discussion. Thank you for the response. I forgot to rebase these patches, but there are minor problems with applying them. > > Just picking one example of how you convert blk_* functions: > > > -BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, > > - int nb_sectors, BdrvRequestFlags flags, > > - BlockCompletionFunc *cb, void *opaque) > > +BlockAIOCB *blk_aio_write_zeroes_impl(BlockBackend *blk, int64_t sector_num, > > + int nb_sectors, BdrvRequestFlags flags, > > + BlockCompletionFunc *cb, void *opaque) > > { > > int ret = blk_check_request(blk, sector_num, nb_sectors); > > if (ret < 0) { > > @@ -673,6 +674,13 @@ BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, > > cb, opaque); > > } > > > > +BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, > > + int nb_sectors, BdrvRequestFlags flags, > > + BlockCompletionFunc *cb, void *opaque) > > +{ > > + return replay_aio_write_zeroes(blk, sector_num, nb_sectors, flags, cb, opaque); > > +} > > + > > > +BlockAIOCB *replay_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, > > + int nb_sectors, BdrvRequestFlags flags, > > + BlockCompletionFunc *cb, void *opaque) > > +{ > > + if (replay_mode == REPLAY_MODE_NONE) { > > + return blk_aio_write_zeroes_impl(blk, sector_num, nb_sectors, flags, cb, opaque); > > + } else { > > + ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES, > > + blk, cb, opaque); > > + acb->req.sector = sector_num; > > + acb->req.nb_sectors = nb_sectors; > > + acb->req.flags = flags; > > + replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES, acb, NULL, 0); > > + > > + return &acb->common; > > + } > > +} > > I think it's obvious that adding two functions to the call chain which > do nothing in the common case is a bit ugly. If we did this for every > feature that could possibly be enabled, we'd end up with two-kilometer > stack traces. > > So definitely don't call into replay.c, which just calls back in 99.9% > of the cases, but if anything, do the check in block-backends.c. This way seems to be better. > But even this doesn't feel completely right, because block drivers are > already layered and there is no need to hardcode something optional (and > rarely used) in the hot code path that could just be another layer. > > I assume that you know beforehand if you want to replay something, so > requiring you to configure your block devices with a replay driver on > top of the stack seems reasonable enough. I cannot use block drivers for this. When driver functions are used, QEMU is already used coroutines (and probably started bottom halves). Coroutines make execution non-deterministic. That's why we have to intercept blk_aio_ functions, that are called deterministically. Pavel Dovgalyuk
Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > But even this doesn't feel completely right, because block drivers are > > already layered and there is no need to hardcode something optional (and > > rarely used) in the hot code path that could just be another layer. > > > > I assume that you know beforehand if you want to replay something, so > > requiring you to configure your block devices with a replay driver on > > top of the stack seems reasonable enough. > > I cannot use block drivers for this. When driver functions are used, QEMU > is already used coroutines (and probably started bottom halves). > Coroutines make execution non-deterministic. > That's why we have to intercept blk_aio_ functions, that are called > deterministically. What does "deterministic" mean in this context, i.e. what are your exact requirements? I don't think that coroutines introduce anything non-deterministic per se. Depending on what you mean by it, the block layer code paths in block.c may contain problematic code. The block layer uses bottom halves in some cases for request completion, but not until calling into the first driver (why would they be a problem?). What could happen is that a request is serialised and therefore delayed in some special configurations, which sounds a lot like what you wanted to avoid. Is there a writeup somewhere of the context in which this code is called and the rules that need to be considered for replay to work? Kevin
> -----Original Message----- > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben: > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > But even this doesn't feel completely right, because block drivers are > > > already layered and there is no need to hardcode something optional (and > > > rarely used) in the hot code path that could just be another layer. > > > > > > I assume that you know beforehand if you want to replay something, so > > > requiring you to configure your block devices with a replay driver on > > > top of the stack seems reasonable enough. > > > > I cannot use block drivers for this. When driver functions are used, QEMU > > is already used coroutines (and probably started bottom halves). > > Coroutines make execution non-deterministic. > > That's why we have to intercept blk_aio_ functions, that are called > > deterministically. > > What does "deterministic" mean in this context, i.e. what are your exact > requirements? "Deterministic" means that the replayed execution should run exactly the same guest instructions in the same sequence, as in recording session. > I don't think that coroutines introduce anything non-deterministic per > se. Depending on what you mean by it, the block layer code paths in > block.c may contain problematic code. They are non-deterministic if we need instruction-level accuracy. Thread switching (and therefore callbacks and BH execution) is non-deterministic. In two different executions these callbacks may happen at different moments of time (counting in number of executed instructions). All operations with virtual devices (including memory, interrupt controller, and disk drive controller) should happen at deterministic moments of time to be replayable. > The block layer uses bottom halves in some cases for request completion, > but not until calling into the first driver (why would they be a > problem?). What could happen is that a request is serialised and > therefore delayed in some special configurations, which sounds a lot > like what you wanted to avoid. Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm. First ones have to be deterministic, because they affect guest memory, virtual disk controller, and interrupts. > Is there a writeup somewhere of the context in which this code is called > and the rules that need to be considered for replay to work? There is some information about it in docs/replay.txt and in http://wiki.qemu.org/Features/record-replay Pavel Dovgalyuk
Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben: > > -----Original Message----- > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben: > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > But even this doesn't feel completely right, because block drivers are > > > > already layered and there is no need to hardcode something optional (and > > > > rarely used) in the hot code path that could just be another layer. > > > > > > > > I assume that you know beforehand if you want to replay something, so > > > > requiring you to configure your block devices with a replay driver on > > > > top of the stack seems reasonable enough. > > > > > > I cannot use block drivers for this. When driver functions are used, QEMU > > > is already used coroutines (and probably started bottom halves). > > > Coroutines make execution non-deterministic. > > > That's why we have to intercept blk_aio_ functions, that are called > > > deterministically. > > > > What does "deterministic" mean in this context, i.e. what are your exact > > requirements? > > "Deterministic" means that the replayed execution should run exactly > the same guest instructions in the same sequence, as in recording session. Okay. I think with this we can do better than what you have now. > > I don't think that coroutines introduce anything non-deterministic per > > se. Depending on what you mean by it, the block layer code paths in > > block.c may contain problematic code. > > They are non-deterministic if we need instruction-level accuracy. > Thread switching (and therefore callbacks and BH execution) is non-deterministic. Thread switching depends on an external event (the kernel scheduler deciding to switch), so agreed, if a thread switch ever influences what the guest sees, that would be a problem. Generally, however, callbacks and BHs don't involve a thread switch at all (BHs can be invoked from a different thread in theory, but we have very few of those cases and they shouldn't be visible for the guest). The same is true for coroutines, which are semantically equivalent to callbacks. > In two different executions these callbacks may happen at different moments of > time (counting in number of executed instructions). > All operations with virtual devices (including memory, interrupt controller, > and disk drive controller) should happen at deterministic moments of time > to be replayable. Right, so let's talk about what this external non-deterministic event really is. I think the only thing whose timing is unknown in the block layer is the completion of I/O requests. This non-determinism comes from the time the I/O syscalls made by the lowest layer (usually raw-posix) take. This means that we can add logic to remove the non-determinism at the point of our choice between raw-posix and the guest device emulation. A block driver on top is as good as anything else. While recording, this block driver would just pass the request to next lower layer (starting a request is deterministic, so it doesn't need to be logged) and once the request completes it logs it. While replaying, the completion of requests is delayed until we read it in the log; if we read it in the log and the request hasn't completed yet, we do a busy wait for it (while(!completed) aio_poll();). This model would get rid of the bdrv_drain_all() that you call everywhere and therefore allow concurrent requests, giving a result that is much closer to the "normal" behaviour without replay. > > The block layer uses bottom halves in some cases for request completion, > > but not until calling into the first driver (why would they be a > > problem?). What could happen is that a request is serialised and > > therefore delayed in some special configurations, which sounds a lot > > like what you wanted to avoid. > > Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm. > First ones have to be deterministic, because they affect guest memory, > virtual disk controller, and interrupts. Sure they can, these are two different callbacks. But even if they couldn't, making more things than necessary deterministic might be wasteful, but not really harmful. > > Is there a writeup somewhere of the context in which this code is called > > and the rules that need to be considered for replay to work? > > There is some information about it in docs/replay.txt and in > http://wiki.qemu.org/Features/record-replay Thanks, that was useful. Especially the explanation of non-deterministic events in docs/replay.txt seems to be very much in line with what I'm suggesting, as it made clear that external events are what we should be looking for. Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben: > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben: > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > But even this doesn't feel completely right, because block drivers are > > > > > already layered and there is no need to hardcode something optional (and > > > > > rarely used) in the hot code path that could just be another layer. > > > > > > > > > > I assume that you know beforehand if you want to replay something, so > > > > > requiring you to configure your block devices with a replay driver on > > > > > top of the stack seems reasonable enough. > > > > > > > > I cannot use block drivers for this. When driver functions are used, QEMU > > > > is already used coroutines (and probably started bottom halves). > > > > Coroutines make execution non-deterministic. > > > > That's why we have to intercept blk_aio_ functions, that are called > > > > deterministically. > > > > > > What does "deterministic" mean in this context, i.e. what are your exact > > > requirements? > > > > "Deterministic" means that the replayed execution should run exactly > > the same guest instructions in the same sequence, as in recording session. > > Okay. I think with this we can do better than what you have now. > > > > I don't think that coroutines introduce anything non-deterministic per > > > se. Depending on what you mean by it, the block layer code paths in > > > block.c may contain problematic code. > > > > They are non-deterministic if we need instruction-level accuracy. > > Thread switching (and therefore callbacks and BH execution) is non-deterministic. > > Thread switching depends on an external event (the kernel scheduler > deciding to switch), so agreed, if a thread switch ever influences what > the guest sees, that would be a problem. > > Generally, however, callbacks and BHs don't involve a thread switch at > all (BHs can be invoked from a different thread in theory, but we have > very few of those cases and they shouldn't be visible for the guest). > The same is true for coroutines, which are semantically equivalent to > callbacks. > > > In two different executions these callbacks may happen at different moments of > > time (counting in number of executed instructions). > > All operations with virtual devices (including memory, interrupt controller, > > and disk drive controller) should happen at deterministic moments of time > > to be replayable. > > Right, so let's talk about what this external non-deterministic event > really is. > > I think the only thing whose timing is unknown in the block layer is the > completion of I/O requests. This non-determinism comes from the time the > I/O syscalls made by the lowest layer (usually raw-posix) take. Right. > This means that we can add logic to remove the non-determinism at the > point of our choice between raw-posix and the guest device emulation. A > block driver on top is as good as anything else. > > While recording, this block driver would just pass the request to next > lower layer (starting a request is deterministic, so it doesn't need to > be logged) and once the request completes it logs it. While replaying, > the completion of requests is delayed until we read it in the log; if we > read it in the log and the request hasn't completed yet, we do a busy > wait for it (while(!completed) aio_poll();). I tried serializing all bottom halves and worker thread callbacks in previous version of the patches. That code was much more complicated and error-prone than the current version. We had to classify all bottom halves to recorded and non-recorded (because sometimes they are used for qemu's purposes, not the guest ones). However, I don't understand yet which layer do you offer as the candidate for record/replay? What functions should be changed? I would like to investigate this way, but I don't got it yet. > This model would get rid of the bdrv_drain_all() that you call > everywhere and therefore allow concurrent requests, giving a result that > is much closer to the "normal" behaviour without replay. > > > > The block layer uses bottom halves in some cases for request completion, > > > but not until calling into the first driver (why would they be a > > > problem?). What could happen is that a request is serialised and > > > therefore delayed in some special configurations, which sounds a lot > > > like what you wanted to avoid. > > > > Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm. > > First ones have to be deterministic, because they affect guest memory, > > virtual disk controller, and interrupts. > > Sure they can, these are two different callbacks. But even if they > couldn't, making more things than necessary deterministic might be > wasteful, but not really harmful. Is there any universal way to check this? Pavel Dovgalyuk
Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben: > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben: > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > But even this doesn't feel completely right, because block drivers are > > > > > > already layered and there is no need to hardcode something optional (and > > > > > > rarely used) in the hot code path that could just be another layer. > > > > > > > > > > > > I assume that you know beforehand if you want to replay something, so > > > > > > requiring you to configure your block devices with a replay driver on > > > > > > top of the stack seems reasonable enough. > > > > > > > > > > I cannot use block drivers for this. When driver functions are used, QEMU > > > > > is already used coroutines (and probably started bottom halves). > > > > > Coroutines make execution non-deterministic. > > > > > That's why we have to intercept blk_aio_ functions, that are called > > > > > deterministically. > > > > > > > > What does "deterministic" mean in this context, i.e. what are your exact > > > > requirements? > > > > > > "Deterministic" means that the replayed execution should run exactly > > > the same guest instructions in the same sequence, as in recording session. > > > > Okay. I think with this we can do better than what you have now. > > > > > > I don't think that coroutines introduce anything non-deterministic per > > > > se. Depending on what you mean by it, the block layer code paths in > > > > block.c may contain problematic code. > > > > > > They are non-deterministic if we need instruction-level accuracy. > > > Thread switching (and therefore callbacks and BH execution) is non-deterministic. > > > > Thread switching depends on an external event (the kernel scheduler > > deciding to switch), so agreed, if a thread switch ever influences what > > the guest sees, that would be a problem. > > > > Generally, however, callbacks and BHs don't involve a thread switch at > > all (BHs can be invoked from a different thread in theory, but we have > > very few of those cases and they shouldn't be visible for the guest). > > The same is true for coroutines, which are semantically equivalent to > > callbacks. > > > > > In two different executions these callbacks may happen at different moments of > > > time (counting in number of executed instructions). > > > All operations with virtual devices (including memory, interrupt controller, > > > and disk drive controller) should happen at deterministic moments of time > > > to be replayable. > > > > Right, so let's talk about what this external non-deterministic event > > really is. > > > > I think the only thing whose timing is unknown in the block layer is the > > completion of I/O requests. This non-determinism comes from the time the > > I/O syscalls made by the lowest layer (usually raw-posix) take. > > Right. > > > This means that we can add logic to remove the non-determinism at the > > point of our choice between raw-posix and the guest device emulation. A > > block driver on top is as good as anything else. > > > > While recording, this block driver would just pass the request to next > > lower layer (starting a request is deterministic, so it doesn't need to > > be logged) and once the request completes it logs it. While replaying, > > the completion of requests is delayed until we read it in the log; if we > > read it in the log and the request hasn't completed yet, we do a busy > > wait for it (while(!completed) aio_poll();). > > I tried serializing all bottom halves and worker thread callbacks in > previous version of the patches. That code was much more complicated > and error-prone than the current version. We had to classify all bottom > halves to recorded and non-recorded (because sometimes they are used > for qemu's purposes, not the guest ones). > > However, I don't understand yet which layer do you offer as the candidate > for record/replay? What functions should be changed? > I would like to investigate this way, but I don't got it yet. At the core, I wouldn't change any existing function, but introduce a new block driver. You could copy raw_bsd.c for a start and then tweak it. Leave out functions that you don't want to support, and add the necessary magic to .bdrv_co_readv/writev. Something like this (can probably be generalised for more than just reads as the part after the bdrv_co_reads() call should be the same for reads, writes and any other request types): int blkreplay_co_readv() { BlockReplayState *s = bs->opaque; int reqid = s->reqid++; bdrv_co_readv(bs->file, ...); if (mode == record) { log(reqid, time); } else { assert(mode == replay); bool *done = req_replayed_list_get(reqid) if (done) { *done = true; } else { req_completed_list_insert(reqid, qemu_coroutine_self()); qemu_coroutine_yield(); } } } /* called by replay.c */ int blkreplay_run_event() { if (mode == replay) { co = req_completed_list_get(e.reqid); if (co) { qemu_coroutine_enter(co); } else { bool done = false; req_replayed_list_insert(reqid, &done); /* wait synchronously for completion */ while (!done) { aio_poll(); } } } } Where we could consider changing existing code is that it might be desirable to automatically put an instance of this block driver on top of every block device when record/replay is used. If we don't do that, you need to explicitly specify -drive driver=blkreplay,... > > This model would get rid of the bdrv_drain_all() that you call > > everywhere and therefore allow concurrent requests, giving a result that > > is much closer to the "normal" behaviour without replay. > > > > > > The block layer uses bottom halves in some cases for request completion, > > > > but not until calling into the first driver (why would they be a > > > > problem?). What could happen is that a request is serialised and > > > > therefore delayed in some special configurations, which sounds a lot > > > > like what you wanted to avoid. > > > > > > Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm. > > > First ones have to be deterministic, because they affect guest memory, > > > virtual disk controller, and interrupts. > > > > Sure they can, these are two different callbacks. But even if they > > couldn't, making more things than necessary deterministic might be > > wasteful, but not really harmful. > > Is there any universal way to check this? No, the block layer doesn't know its caller. It's only possible in this specific case because the guest never calls .bdrv_save_vmstate(). But as I said, logging more requests than necessary doesn't really hurt expect making the log a bit larger. Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben: > > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben: > > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > But even this doesn't feel completely right, because block drivers are > > > > > > > already layered and there is no need to hardcode something optional (and > > > > > > > rarely used) in the hot code path that could just be another layer. > > > > > > > > > > > > > > I assume that you know beforehand if you want to replay something, so > > > > > > > requiring you to configure your block devices with a replay driver on > > > > > > > top of the stack seems reasonable enough. > > > > > > > > > > > > I cannot use block drivers for this. When driver functions are used, QEMU > > > > > > is already used coroutines (and probably started bottom halves). > > > > > > Coroutines make execution non-deterministic. > > > > > > That's why we have to intercept blk_aio_ functions, that are called > > > > > > deterministically. > > > > > > > > > > What does "deterministic" mean in this context, i.e. what are your exact > > > > > requirements? > > > > > > > > "Deterministic" means that the replayed execution should run exactly > > > > the same guest instructions in the same sequence, as in recording session. > > > > > > Okay. I think with this we can do better than what you have now. > > > > > > > > I don't think that coroutines introduce anything non-deterministic per > > > > > se. Depending on what you mean by it, the block layer code paths in > > > > > block.c may contain problematic code. > > > > > > > > They are non-deterministic if we need instruction-level accuracy. > > > > Thread switching (and therefore callbacks and BH execution) is non-deterministic. > > > > > > Thread switching depends on an external event (the kernel scheduler > > > deciding to switch), so agreed, if a thread switch ever influences what > > > the guest sees, that would be a problem. > > > > > > Generally, however, callbacks and BHs don't involve a thread switch at > > > all (BHs can be invoked from a different thread in theory, but we have > > > very few of those cases and they shouldn't be visible for the guest). > > > The same is true for coroutines, which are semantically equivalent to > > > callbacks. > > > > > > > In two different executions these callbacks may happen at different moments of > > > > time (counting in number of executed instructions). > > > > All operations with virtual devices (including memory, interrupt controller, > > > > and disk drive controller) should happen at deterministic moments of time > > > > to be replayable. > > > > > > Right, so let's talk about what this external non-deterministic event > > > really is. > > > > > > I think the only thing whose timing is unknown in the block layer is the > > > completion of I/O requests. This non-determinism comes from the time the > > > I/O syscalls made by the lowest layer (usually raw-posix) take. > > > > Right. > > > > > This means that we can add logic to remove the non-determinism at the > > > point of our choice between raw-posix and the guest device emulation. A > > > block driver on top is as good as anything else. > > > > > > While recording, this block driver would just pass the request to next > > > lower layer (starting a request is deterministic, so it doesn't need to > > > be logged) and once the request completes it logs it. While replaying, > > > the completion of requests is delayed until we read it in the log; if we > > > read it in the log and the request hasn't completed yet, we do a busy > > > wait for it (while(!completed) aio_poll();). > > > > I tried serializing all bottom halves and worker thread callbacks in > > previous version of the patches. That code was much more complicated > > and error-prone than the current version. We had to classify all bottom > > halves to recorded and non-recorded (because sometimes they are used > > for qemu's purposes, not the guest ones). > > > > However, I don't understand yet which layer do you offer as the candidate > > for record/replay? What functions should be changed? > > I would like to investigate this way, but I don't got it yet. > > At the core, I wouldn't change any existing function, but introduce a > new block driver. You could copy raw_bsd.c for a start and then tweak > it. Leave out functions that you don't want to support, and add the > necessary magic to .bdrv_co_readv/writev. > > Something like this (can probably be generalised for more than just > reads as the part after the bdrv_co_reads() call should be the same for > reads, writes and any other request types): > > int blkreplay_co_readv() > { > BlockReplayState *s = bs->opaque; > int reqid = s->reqid++; > > bdrv_co_readv(bs->file, ...); > > if (mode == record) { > log(reqid, time); > } else { > assert(mode == replay); > bool *done = req_replayed_list_get(reqid) > if (done) { > *done = true; > } else { > req_completed_list_insert(reqid, qemu_coroutine_self()); > qemu_coroutine_yield(); > } > } > } > > /* called by replay.c */ > int blkreplay_run_event() > { > if (mode == replay) { > co = req_completed_list_get(e.reqid); > if (co) { > qemu_coroutine_enter(co); > } else { > bool done = false; > req_replayed_list_insert(reqid, &done); > /* wait synchronously for completion */ > while (!done) { > aio_poll(); > } > } > } > } > > Where we could consider changing existing code is that it might be > desirable to automatically put an instance of this block driver on top > of every block device when record/replay is used. If we don't do that, > you need to explicitly specify -drive driver=blkreplay,... Thank you. > > > > This model would get rid of the bdrv_drain_all() that you call > > > everywhere and therefore allow concurrent requests, giving a result that > > > is much closer to the "normal" behaviour without replay. > > > > > > > > The block layer uses bottom halves in some cases for request completion, > > > > > but not until calling into the first driver (why would they be a > > > > > problem?). What could happen is that a request is serialised and > > > > > therefore delayed in some special configurations, which sounds a lot > > > > > like what you wanted to avoid. > > > > > > > > Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm. > > > > First ones have to be deterministic, because they affect guest memory, > > > > virtual disk controller, and interrupts. > > > > > > Sure they can, these are two different callbacks. But even if they > > > couldn't, making more things than necessary deterministic might be > > > wasteful, but not really harmful. > > > > Is there any universal way to check this? > > No, the block layer doesn't know its caller. It's only possible in this > specific case because the guest never calls .bdrv_save_vmstate(). > > But as I said, logging more requests than necessary doesn't really hurt > expect making the log a bit larger. Let's make this clear first. Logging more request really hurts, because we may want to do some additional things in record or replay modes. E.g., we may save some snapshots (in record mode) to be used later. Saving operations should not be replayed. But in replay mode we may load back some of these snapshots and replay from that moment. Or user will issue a command in monitor and it will read some disk information. That's why I want to record only guest operations. Pavel Dovgalyuk
Am 10.02.2016 um 14:33 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben: > > > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben: > > > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > > But even this doesn't feel completely right, because block drivers are > > > > > > > > already layered and there is no need to hardcode something optional (and > > > > > > > > rarely used) in the hot code path that could just be another layer. > > > > > > > > > > > > > > > > I assume that you know beforehand if you want to replay something, so > > > > > > > > requiring you to configure your block devices with a replay driver on > > > > > > > > top of the stack seems reasonable enough. > > > > > > > > > > > > > > I cannot use block drivers for this. When driver functions are used, QEMU > > > > > > > is already used coroutines (and probably started bottom halves). > > > > > > > Coroutines make execution non-deterministic. > > > > > > > That's why we have to intercept blk_aio_ functions, that are called > > > > > > > deterministically. > > > > > > > > > > > > What does "deterministic" mean in this context, i.e. what are your exact > > > > > > requirements? > > > > > > > > > > "Deterministic" means that the replayed execution should run exactly > > > > > the same guest instructions in the same sequence, as in recording session. > > > > > > > > Okay. I think with this we can do better than what you have now. > > > > > > > > > > I don't think that coroutines introduce anything non-deterministic per > > > > > > se. Depending on what you mean by it, the block layer code paths in > > > > > > block.c may contain problematic code. > > > > > > > > > > They are non-deterministic if we need instruction-level accuracy. > > > > > Thread switching (and therefore callbacks and BH execution) is non-deterministic. > > > > > > > > Thread switching depends on an external event (the kernel scheduler > > > > deciding to switch), so agreed, if a thread switch ever influences what > > > > the guest sees, that would be a problem. > > > > > > > > Generally, however, callbacks and BHs don't involve a thread switch at > > > > all (BHs can be invoked from a different thread in theory, but we have > > > > very few of those cases and they shouldn't be visible for the guest). > > > > The same is true for coroutines, which are semantically equivalent to > > > > callbacks. > > > > > > > > > In two different executions these callbacks may happen at different moments of > > > > > time (counting in number of executed instructions). > > > > > All operations with virtual devices (including memory, interrupt controller, > > > > > and disk drive controller) should happen at deterministic moments of time > > > > > to be replayable. > > > > > > > > Right, so let's talk about what this external non-deterministic event > > > > really is. > > > > > > > > I think the only thing whose timing is unknown in the block layer is the > > > > completion of I/O requests. This non-determinism comes from the time the > > > > I/O syscalls made by the lowest layer (usually raw-posix) take. > > > > > > Right. > > > > > > > This means that we can add logic to remove the non-determinism at the > > > > point of our choice between raw-posix and the guest device emulation. A > > > > block driver on top is as good as anything else. > > > > > > > > While recording, this block driver would just pass the request to next > > > > lower layer (starting a request is deterministic, so it doesn't need to > > > > be logged) and once the request completes it logs it. While replaying, > > > > the completion of requests is delayed until we read it in the log; if we > > > > read it in the log and the request hasn't completed yet, we do a busy > > > > wait for it (while(!completed) aio_poll();). > > > > > > I tried serializing all bottom halves and worker thread callbacks in > > > previous version of the patches. That code was much more complicated > > > and error-prone than the current version. We had to classify all bottom > > > halves to recorded and non-recorded (because sometimes they are used > > > for qemu's purposes, not the guest ones). > > > > > > However, I don't understand yet which layer do you offer as the candidate > > > for record/replay? What functions should be changed? > > > I would like to investigate this way, but I don't got it yet. > > > > At the core, I wouldn't change any existing function, but introduce a > > new block driver. You could copy raw_bsd.c for a start and then tweak > > it. Leave out functions that you don't want to support, and add the > > necessary magic to .bdrv_co_readv/writev. > > > > Something like this (can probably be generalised for more than just > > reads as the part after the bdrv_co_reads() call should be the same for > > reads, writes and any other request types): > > > > int blkreplay_co_readv() > > { > > BlockReplayState *s = bs->opaque; > > int reqid = s->reqid++; > > > > bdrv_co_readv(bs->file, ...); > > > > if (mode == record) { > > log(reqid, time); > > } else { > > assert(mode == replay); > > bool *done = req_replayed_list_get(reqid) > > if (done) { > > *done = true; > > } else { > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > qemu_coroutine_yield(); > > } > > } > > } > > > > /* called by replay.c */ > > int blkreplay_run_event() > > { > > if (mode == replay) { > > co = req_completed_list_get(e.reqid); > > if (co) { > > qemu_coroutine_enter(co); > > } else { > > bool done = false; > > req_replayed_list_insert(reqid, &done); > > /* wait synchronously for completion */ > > while (!done) { > > aio_poll(); > > } > > } > > } > > } > > > > Where we could consider changing existing code is that it might be > > desirable to automatically put an instance of this block driver on top > > of every block device when record/replay is used. If we don't do that, > > you need to explicitly specify -drive driver=blkreplay,... > > Thank you. > > > > > > > This model would get rid of the bdrv_drain_all() that you call > > > > everywhere and therefore allow concurrent requests, giving a result that > > > > is much closer to the "normal" behaviour without replay. > > > > > > > > > > The block layer uses bottom halves in some cases for request completion, > > > > > > but not until calling into the first driver (why would they be a > > > > > > problem?). What could happen is that a request is serialised and > > > > > > therefore delayed in some special configurations, which sounds a lot > > > > > > like what you wanted to avoid. > > > > > > > > > > Drivers cannot distinguish the requests from guest CPU and from savevm/loadvm. > > > > > First ones have to be deterministic, because they affect guest memory, > > > > > virtual disk controller, and interrupts. > > > > > > > > Sure they can, these are two different callbacks. But even if they > > > > couldn't, making more things than necessary deterministic might be > > > > wasteful, but not really harmful. > > > > > > Is there any universal way to check this? > > > > No, the block layer doesn't know its caller. It's only possible in this > > specific case because the guest never calls .bdrv_save_vmstate(). > > > > But as I said, logging more requests than necessary doesn't really hurt > > expect making the log a bit larger. > > Let's make this clear first. > Logging more request really hurts, because we may want to do some additional > things in record or replay modes. > E.g., we may save some snapshots (in record mode) to be used later. > Saving operations should not be replayed. But in replay mode we may load back > some of these snapshots and replay from that moment. > Or user will issue a command in monitor and it will read some disk information. > > That's why I want to record only guest operations. Understood. As I said, for savevm/loadvm we are okay because they don't use the same functions as guest I/O anyway. There are a few more advanced block layer features (like block jobs or the built-in NBD server) that produce requests even though they are not in the guest. The solution for that is to target them at the layer below the blkreplay driver, which is identifiable by its own node-name. This is not possible in all cases yet, but we are actively working on allowing this. I expect that using such features together with record and replay is highly unusual anyway. Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben: > > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben: > > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > But even this doesn't feel completely right, because block drivers are > > > > > > > already layered and there is no need to hardcode something optional (and > > > > > > > rarely used) in the hot code path that could just be another layer. > > > > > > > > > > > > > > I assume that you know beforehand if you want to replay something, so > > > > > > > requiring you to configure your block devices with a replay driver on > > > > > > > top of the stack seems reasonable enough. > > > > > > > > > > > > I cannot use block drivers for this. When driver functions are used, QEMU > > > > > > is already used coroutines (and probably started bottom halves). > > > > > > Coroutines make execution non-deterministic. > > > > > > That's why we have to intercept blk_aio_ functions, that are called > > > > > > deterministically. > > > > > > > > > > What does "deterministic" mean in this context, i.e. what are your exact > > > > > requirements? > > > > > > > > "Deterministic" means that the replayed execution should run exactly > > > > the same guest instructions in the same sequence, as in recording session. > > > > > > Okay. I think with this we can do better than what you have now. > > > > > > > > I don't think that coroutines introduce anything non-deterministic per > > > > > se. Depending on what you mean by it, the block layer code paths in > > > > > block.c may contain problematic code. > > > > > > > > They are non-deterministic if we need instruction-level accuracy. > > > > Thread switching (and therefore callbacks and BH execution) is non-deterministic. > > > > > > Thread switching depends on an external event (the kernel scheduler > > > deciding to switch), so agreed, if a thread switch ever influences what > > > the guest sees, that would be a problem. > > > > > > Generally, however, callbacks and BHs don't involve a thread switch at > > > all (BHs can be invoked from a different thread in theory, but we have > > > very few of those cases and they shouldn't be visible for the guest). > > > The same is true for coroutines, which are semantically equivalent to > > > callbacks. > > > > > > > In two different executions these callbacks may happen at different moments of > > > > time (counting in number of executed instructions). > > > > All operations with virtual devices (including memory, interrupt controller, > > > > and disk drive controller) should happen at deterministic moments of time > > > > to be replayable. > > > > > > Right, so let's talk about what this external non-deterministic event > > > really is. > > > > > > I think the only thing whose timing is unknown in the block layer is the > > > completion of I/O requests. This non-determinism comes from the time the > > > I/O syscalls made by the lowest layer (usually raw-posix) take. > > > > Right. > > > > > This means that we can add logic to remove the non-determinism at the > > > point of our choice between raw-posix and the guest device emulation. A > > > block driver on top is as good as anything else. > > > > > > While recording, this block driver would just pass the request to next > > > lower layer (starting a request is deterministic, so it doesn't need to > > > be logged) and once the request completes it logs it. While replaying, > > > the completion of requests is delayed until we read it in the log; if we > > > read it in the log and the request hasn't completed yet, we do a busy > > > wait for it (while(!completed) aio_poll();). > > > > I tried serializing all bottom halves and worker thread callbacks in > > previous version of the patches. That code was much more complicated > > and error-prone than the current version. We had to classify all bottom > > halves to recorded and non-recorded (because sometimes they are used > > for qemu's purposes, not the guest ones). > > > > However, I don't understand yet which layer do you offer as the candidate > > for record/replay? What functions should be changed? > > I would like to investigate this way, but I don't got it yet. > > At the core, I wouldn't change any existing function, but introduce a > new block driver. You could copy raw_bsd.c for a start and then tweak > it. Leave out functions that you don't want to support, and add the > necessary magic to .bdrv_co_readv/writev. > > Something like this (can probably be generalised for more than just > reads as the part after the bdrv_co_reads() call should be the same for > reads, writes and any other request types): > > int blkreplay_co_readv() > { > BlockReplayState *s = bs->opaque; > int reqid = s->reqid++; > > bdrv_co_readv(bs->file, ...); > > if (mode == record) { > log(reqid, time); > } else { > assert(mode == replay); > bool *done = req_replayed_list_get(reqid) > if (done) { > *done = true; > } else { > req_completed_list_insert(reqid, qemu_coroutine_self()); > qemu_coroutine_yield(); > } > } > } > > /* called by replay.c */ > int blkreplay_run_event() > { > if (mode == replay) { > co = req_completed_list_get(e.reqid); > if (co) { > qemu_coroutine_enter(co); > } else { > bool done = false; > req_replayed_list_insert(reqid, &done); > /* wait synchronously for completion */ > while (!done) { > aio_poll(); > } > } > } > } > > Where we could consider changing existing code is that it might be > desirable to automatically put an instance of this block driver on top > of every block device when record/replay is used. If we don't do that, > you need to explicitly specify -drive driver=blkreplay,... As far, as I understand, all synchronous read/write request are also passed through this coroutines layer. It means that every disk access in replay phase should match the recording phase. Record/replay is intended to be used for debugging and analysis. When execution is replayed, guest machine cannot notice analysis overhead. Some analysis methods may include disk image reading. E.g., qemu-based analysis framework DECAF uses sleuthkit for disk forensics ( https://github.com/sycurelab/DECAF ). If similar framework will be used with replay, forensics disk access operations won't work if we will record/replay the coroutines. And if we'll record only high-level asynchronous disk operations, then there will be a way to performs disk accesses without matching these operations with replay log. However, we already tried to record disk operations completion events (in previous more complicated version). Recording overhead was the same as with new blk_aio_ version of the patch. Pavel Dovgalyuk
Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > > However, I don't understand yet which layer do you offer as the candidate > > > for record/replay? What functions should be changed? > > > I would like to investigate this way, but I don't got it yet. > > > > At the core, I wouldn't change any existing function, but introduce a > > new block driver. You could copy raw_bsd.c for a start and then tweak > > it. Leave out functions that you don't want to support, and add the > > necessary magic to .bdrv_co_readv/writev. > > > > Something like this (can probably be generalised for more than just > > reads as the part after the bdrv_co_reads() call should be the same for > > reads, writes and any other request types): > > > > int blkreplay_co_readv() > > { > > BlockReplayState *s = bs->opaque; > > int reqid = s->reqid++; > > > > bdrv_co_readv(bs->file, ...); > > > > if (mode == record) { > > log(reqid, time); > > } else { > > assert(mode == replay); > > bool *done = req_replayed_list_get(reqid) > > if (done) { > > *done = true; > > } else { > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > qemu_coroutine_yield(); > > } > > } > > } > > > > /* called by replay.c */ > > int blkreplay_run_event() > > { > > if (mode == replay) { > > co = req_completed_list_get(e.reqid); > > if (co) { > > qemu_coroutine_enter(co); > > } else { > > bool done = false; > > req_replayed_list_insert(reqid, &done); > > /* wait synchronously for completion */ > > while (!done) { > > aio_poll(); > > } > > } > > } > > } > > > > Where we could consider changing existing code is that it might be > > desirable to automatically put an instance of this block driver on top > > of every block device when record/replay is used. If we don't do that, > > you need to explicitly specify -drive driver=blkreplay,... > > As far, as I understand, all synchronous read/write request are also passed > through this coroutines layer. Yes, all read/write requests go through the same function internally, no matter which external interface was used. > It means that every disk access in replay phase should match the recording phase. Right. If I'm not mistaken, this was the fundamental requirement you have, so I wouldn't have suggested this otherwise. > Record/replay is intended to be used for debugging and analysis. > When execution is replayed, guest machine cannot notice analysis overhead. > Some analysis methods may include disk image reading. E.g., qemu-based > analysis framework DECAF uses sleuthkit for disk forensics ( https://github.com/sycurelab/DECAF ). > If similar framework will be used with replay, forensics disk access operations > won't work if we will record/replay the coroutines. Sorry, I'm not sure if I can follow. If such analysis software runs in the guest, it's not a replay any more and I completely fail to see what you're doing. If it's a qemu component independent from the guest, then my method gives you a clean way to bypass the replay driver that wouldn't be possible with yours. If your plan was to record/replay only async requests and then use sync requests to bypass the record/replay, let me clearly state that this is the wrong approach: There are still guest devices which do synchronous I/O and need to be considered in the replay log, and you shouldn't prevent the analysis code from using AIO (in fact, using sync I/O in new code is very much frowned upon). I can explain in more detail what the block device structure looks like and how to access an image with and without record/replay, but first let me please know whether I guessed right what your problem is. Or if I missed your point, can you please describe in detail a case that wouldn't work? > And if we'll record only high-level asynchronous disk operations, then > there will be a way to performs disk accesses without matching these > operations with replay log. > > However, we already tried to record disk operations completion events > (in previous more complicated version). Recording overhead was the same > as with new blk_aio_ version of the patch. Well, if your code really intentionally ignores synchronous requests, then the advantage you get from my approach is being correct. Other than that, I'm personally mainly interested in keeping the core block layer easily maintainable and the hot I/O paths clean for cases that don't use all the features qemu has. This means that in the mid term existing features need to move out of the core block layer rather than new ones being added. So doing things in a separate block driver might not be an advantage for your immediate needs, but it's important for clean modularisation and a requirement for things to get merged (unless you can prove that it's impossible to do this way, but I don't think you can - otherwise the whole block layer design would be wrong). Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben: > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > > > However, I don't understand yet which layer do you offer as the candidate > > > > for record/replay? What functions should be changed? > > > > I would like to investigate this way, but I don't got it yet. > > > > > > At the core, I wouldn't change any existing function, but introduce a > > > new block driver. You could copy raw_bsd.c for a start and then tweak > > > it. Leave out functions that you don't want to support, and add the > > > necessary magic to .bdrv_co_readv/writev. > > > > > > Something like this (can probably be generalised for more than just > > > reads as the part after the bdrv_co_reads() call should be the same for > > > reads, writes and any other request types): > > > > > > int blkreplay_co_readv() > > > { > > > BlockReplayState *s = bs->opaque; > > > int reqid = s->reqid++; > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > if (mode == record) { > > > log(reqid, time); > > > } else { > > > assert(mode == replay); > > > bool *done = req_replayed_list_get(reqid) > > > if (done) { > > > *done = true; > > > } else { > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > qemu_coroutine_yield(); > > > } > > > } > > > } > > > > > > /* called by replay.c */ > > > int blkreplay_run_event() > > > { > > > if (mode == replay) { > > > co = req_completed_list_get(e.reqid); > > > if (co) { > > > qemu_coroutine_enter(co); > > > } else { > > > bool done = false; > > > req_replayed_list_insert(reqid, &done); > > > /* wait synchronously for completion */ > > > while (!done) { > > > aio_poll(); > > > } > > > } > > > } > > > } > > > > > > Where we could consider changing existing code is that it might be > > > desirable to automatically put an instance of this block driver on top > > > of every block device when record/replay is used. If we don't do that, > > > you need to explicitly specify -drive driver=blkreplay,... > > > > As far, as I understand, all synchronous read/write request are also passed > > through this coroutines layer. > > Yes, all read/write requests go through the same function internally, no > matter which external interface was used. > > > It means that every disk access in replay phase should match the recording phase. > > Right. If I'm not mistaken, this was the fundamental requirement you > have, so I wouldn't have suggested this otherwise. > > > Record/replay is intended to be used for debugging and analysis. > > When execution is replayed, guest machine cannot notice analysis overhead. > > Some analysis methods may include disk image reading. E.g., qemu-based > > analysis framework DECAF uses sleuthkit for disk forensics ( > https://github.com/sycurelab/DECAF ). > > If similar framework will be used with replay, forensics disk access operations > > won't work if we will record/replay the coroutines. > > Sorry, I'm not sure if I can follow. > > If such analysis software runs in the guest, it's not a replay any more > and I completely fail to see what you're doing. > > If it's a qemu component independent from the guest, then my method > gives you a clean way to bypass the replay driver that wouldn't be > possible with yours. The second one. qemu may be extended with some components that perform guest introspection. > If your plan was to record/replay only async requests and then use sync > requests to bypass the record/replay, let me clearly state that this is > the wrong approach: There are still guest devices which do synchronous > I/O and need to be considered in the replay log, and you shouldn't > prevent the analysis code from using AIO (in fact, using sync I/O in new > code is very much frowned upon). Why do guest synchronous requests have to be recorded? Aren't they completely deterministic? > I can explain in more detail what the block device structure looks like > and how to access an image with and without record/replay, but first let > me please know whether I guessed right what your problem is. Or if I > missed your point, can you please describe in detail a case that > wouldn't work? You have understood it correctly. And what is the solution for bypassing one of the layers from component that should not affect the replay? > > And if we'll record only high-level asynchronous disk operations, then > > there will be a way to performs disk accesses without matching these > > operations with replay log. > > > > However, we already tried to record disk operations completion events > > (in previous more complicated version). Recording overhead was the same > > as with new blk_aio_ version of the patch. > > Well, if your code really intentionally ignores synchronous requests, > then the advantage you get from my approach is being correct. > > Other than that, I'm personally mainly interested in keeping the core > block layer easily maintainable and the hot I/O paths clean for cases > that don't use all the features qemu has. This means that in the mid > term existing features need to move out of the core block layer rather > than new ones being added. So doing things in a separate block driver > might not be an advantage for your immediate needs, but it's important > for clean modularisation and a requirement for things to get merged > (unless you can prove that it's impossible to do this way, but I don't > think you can - otherwise the whole block layer design would be wrong). Ok, let's try to find the solution with a new driver. Pavel Dovgalyuk
Am 11.02.2016 um 12:00 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben: > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > > > > However, I don't understand yet which layer do you offer as the candidate > > > > > for record/replay? What functions should be changed? > > > > > I would like to investigate this way, but I don't got it yet. > > > > > > > > At the core, I wouldn't change any existing function, but introduce a > > > > new block driver. You could copy raw_bsd.c for a start and then tweak > > > > it. Leave out functions that you don't want to support, and add the > > > > necessary magic to .bdrv_co_readv/writev. > > > > > > > > Something like this (can probably be generalised for more than just > > > > reads as the part after the bdrv_co_reads() call should be the same for > > > > reads, writes and any other request types): > > > > > > > > int blkreplay_co_readv() > > > > { > > > > BlockReplayState *s = bs->opaque; > > > > int reqid = s->reqid++; > > > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > > > if (mode == record) { > > > > log(reqid, time); > > > > } else { > > > > assert(mode == replay); > > > > bool *done = req_replayed_list_get(reqid) > > > > if (done) { > > > > *done = true; > > > > } else { > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > qemu_coroutine_yield(); > > > > } > > > > } > > > > } > > > > > > > > /* called by replay.c */ > > > > int blkreplay_run_event() > > > > { > > > > if (mode == replay) { > > > > co = req_completed_list_get(e.reqid); > > > > if (co) { > > > > qemu_coroutine_enter(co); > > > > } else { > > > > bool done = false; > > > > req_replayed_list_insert(reqid, &done); > > > > /* wait synchronously for completion */ > > > > while (!done) { > > > > aio_poll(); > > > > } > > > > } > > > > } > > > > } > > > > > > > > Where we could consider changing existing code is that it might be > > > > desirable to automatically put an instance of this block driver on top > > > > of every block device when record/replay is used. If we don't do that, > > > > you need to explicitly specify -drive driver=blkreplay,... > > > > > > As far, as I understand, all synchronous read/write request are also passed > > > through this coroutines layer. > > > > Yes, all read/write requests go through the same function internally, no > > matter which external interface was used. > > > > > It means that every disk access in replay phase should match the recording phase. > > > > Right. If I'm not mistaken, this was the fundamental requirement you > > have, so I wouldn't have suggested this otherwise. > > > > > Record/replay is intended to be used for debugging and analysis. > > > When execution is replayed, guest machine cannot notice analysis overhead. > > > Some analysis methods may include disk image reading. E.g., qemu-based > > > analysis framework DECAF uses sleuthkit for disk forensics ( > > https://github.com/sycurelab/DECAF ). > > > If similar framework will be used with replay, forensics disk access operations > > > won't work if we will record/replay the coroutines. > > > > Sorry, I'm not sure if I can follow. > > > > If such analysis software runs in the guest, it's not a replay any more > > and I completely fail to see what you're doing. > > > > If it's a qemu component independent from the guest, then my method > > gives you a clean way to bypass the replay driver that wouldn't be > > possible with yours. > > The second one. qemu may be extended with some components that > perform guest introspection. > > > If your plan was to record/replay only async requests and then use sync > > requests to bypass the record/replay, let me clearly state that this is > > the wrong approach: There are still guest devices which do synchronous > > I/O and need to be considered in the replay log, and you shouldn't > > prevent the analysis code from using AIO (in fact, using sync I/O in new > > code is very much frowned upon). > > Why do guest synchronous requests have to be recorded? > Aren't they completely deterministic? Good point. I think you're right in practice. In theory, with dataplane (i.e. when running the request in a separate thread) it could happen, but I guess that isn't very compatible with replay anyway - and at the first sight I couldn't see it performing synchronous requests either. > > I can explain in more detail what the block device structure looks like > > and how to access an image with and without record/replay, but first let > > me please know whether I guessed right what your problem is. Or if I > > missed your point, can you please describe in detail a case that > > wouldn't work? > > You have understood it correctly. > And what is the solution for bypassing one of the layers from component that > should not affect the replay? For this, you need to understand how block drivers are stacked in qemu. Each driver in the stack has a separate struct BlockDriverState, which can be used to access its data. You could hook up things like this: virtio-blk NBD server -------------- ------------ | | v | +------------+ | | blkreplay | | +------------+ | | | v | +------------+ | | qcow2 | <---------------+ +------------+ | v +------------+ | raw-posix | +------------+ | v -------------- filesystem As you see, what I've chosen for the external analysis interface is just an NBD server as this is the component that we already have today. You could hook up any other (new) code there; the important part is that it doesn't work on the BDS of the blkreplay driver, but directly on the BDS of the qcow2 driver. On the command line, it could look like this (this assumes that we don't add syntactic sugar that creates the blkreplay part automatically - we can always do that): -drive file=image.qcow2,if=none,id=img-direct -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay -device virtio-blk-pci,drive=img-blkreplay The NBD server can't be started on the command line, so you'd go to the monitor and start it there with the direct access: (qemu) nbd_server_start unix:/tmp/my_socket (qemu) nbd_server_add img-direct (Exact syntax is untested, but this should roughly be how it works.) Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 11.02.2016 um 12:00 hat Pavel Dovgalyuk geschrieben: > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben: > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > > > > > However, I don't understand yet which layer do you offer as the candidate > > > > > > for record/replay? What functions should be changed? > > > > > > I would like to investigate this way, but I don't got it yet. > > > > > > > > > > At the core, I wouldn't change any existing function, but introduce a > > > > > new block driver. You could copy raw_bsd.c for a start and then tweak > > > > > it. Leave out functions that you don't want to support, and add the > > > > > necessary magic to .bdrv_co_readv/writev. > > > > > > > > > > Something like this (can probably be generalised for more than just > > > > > reads as the part after the bdrv_co_reads() call should be the same for > > > > > reads, writes and any other request types): > > > > > > > > > > int blkreplay_co_readv() > > > > > { > > > > > BlockReplayState *s = bs->opaque; > > > > > int reqid = s->reqid++; > > > > > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > > > > > if (mode == record) { > > > > > log(reqid, time); > > > > > } else { > > > > > assert(mode == replay); > > > > > bool *done = req_replayed_list_get(reqid) > > > > > if (done) { > > > > > *done = true; > > > > > } else { > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > > qemu_coroutine_yield(); > > > > > } > > > > > } > > > > > } > > > > > > > > > > /* called by replay.c */ > > > > > int blkreplay_run_event() > > > > > { > > > > > if (mode == replay) { > > > > > co = req_completed_list_get(e.reqid); > > > > > if (co) { > > > > > qemu_coroutine_enter(co); > > > > > } else { > > > > > bool done = false; > > > > > req_replayed_list_insert(reqid, &done); > > > > > /* wait synchronously for completion */ > > > > > while (!done) { > > > > > aio_poll(); > > > > > } > > > > > } > > > > > } > > > > > } > > > > > > > > > > Where we could consider changing existing code is that it might be > > > > > desirable to automatically put an instance of this block driver on top > > > > > of every block device when record/replay is used. If we don't do that, > > > > > you need to explicitly specify -drive driver=blkreplay,... > > > > > > > > As far, as I understand, all synchronous read/write request are also passed > > > > through this coroutines layer. > > > > > > Yes, all read/write requests go through the same function internally, no > > > matter which external interface was used. > > > > > > > It means that every disk access in replay phase should match the recording phase. > > > > > > Right. If I'm not mistaken, this was the fundamental requirement you > > > have, so I wouldn't have suggested this otherwise. > > > > > > > Record/replay is intended to be used for debugging and analysis. > > > > When execution is replayed, guest machine cannot notice analysis overhead. > > > > Some analysis methods may include disk image reading. E.g., qemu-based > > > > analysis framework DECAF uses sleuthkit for disk forensics ( > > > https://github.com/sycurelab/DECAF ). > > > > If similar framework will be used with replay, forensics disk access operations > > > > won't work if we will record/replay the coroutines. > > > > > > Sorry, I'm not sure if I can follow. > > > > > > If such analysis software runs in the guest, it's not a replay any more > > > and I completely fail to see what you're doing. > > > > > > If it's a qemu component independent from the guest, then my method > > > gives you a clean way to bypass the replay driver that wouldn't be > > > possible with yours. > > > > The second one. qemu may be extended with some components that > > perform guest introspection. > > > > > If your plan was to record/replay only async requests and then use sync > > > requests to bypass the record/replay, let me clearly state that this is > > > the wrong approach: There are still guest devices which do synchronous > > > I/O and need to be considered in the replay log, and you shouldn't > > > prevent the analysis code from using AIO (in fact, using sync I/O in new > > > code is very much frowned upon). > > > > Why do guest synchronous requests have to be recorded? > > Aren't they completely deterministic? > > Good point. I think you're right in practice. In theory, with dataplane > (i.e. when running the request in a separate thread) it could happen, > but I guess that isn't very compatible with replay anyway - and at the > first sight I couldn't see it performing synchronous requests either. > > > > I can explain in more detail what the block device structure looks like > > > and how to access an image with and without record/replay, but first let > > > me please know whether I guessed right what your problem is. Or if I > > > missed your point, can you please describe in detail a case that > > > wouldn't work? > > > > You have understood it correctly. > > And what is the solution for bypassing one of the layers from component that > > should not affect the replay? > > For this, you need to understand how block drivers are stacked in qemu. > Each driver in the stack has a separate struct BlockDriverState, which > can be used to access its data. You could hook up things like this: > > virtio-blk NBD server > -------------- ------------ > | | > v | > +------------+ | > | blkreplay | | > +------------+ | > | | > v | > +------------+ | > | qcow2 | <---------------+ > +------------+ > | > v > +------------+ > | raw-posix | > +------------+ > | > v > -------------- > filesystem > > As you see, what I've chosen for the external analysis interface is just > an NBD server as this is the component that we already have today. You > could hook up any other (new) code there; the important part is that it > doesn't work on the BDS of the blkreplay driver, but directly on the BDS > of the qcow2 driver. > > On the command line, it could look like this (this assumes that we don't > add syntactic sugar that creates the blkreplay part automatically - we > can always do that): > > -drive file=image.qcow2,if=none,id=img-direct > -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay > -device virtio-blk-pci,drive=img-blkreplay > > The NBD server can't be started on the command line, so you'd go to the > monitor and start it there with the direct access: > > (qemu) nbd_server_start unix:/tmp/my_socket > (qemu) nbd_server_add img-direct > > (Exact syntax is untested, but this should roughly be how it works.) Thank you! I'll try this approach and come back either with patches or new questions. Pavel Dovgalyuk
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 11.02.2016 um 12:00 hat Pavel Dovgalyuk geschrieben: > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben: > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > > > > > However, I don't understand yet which layer do you offer as the candidate > > > > > > for record/replay? What functions should be changed? > > > > > > I would like to investigate this way, but I don't got it yet. > > > > > > > > > > At the core, I wouldn't change any existing function, but introduce a > > > > > new block driver. You could copy raw_bsd.c for a start and then tweak > > > > > it. Leave out functions that you don't want to support, and add the > > > > > necessary magic to .bdrv_co_readv/writev. > > > > > > > > > > Something like this (can probably be generalised for more than just > > > > > reads as the part after the bdrv_co_reads() call should be the same for > > > > > reads, writes and any other request types): > > > > > > > > > > int blkreplay_co_readv() > > > > > { > > > > > BlockReplayState *s = bs->opaque; > > > > > int reqid = s->reqid++; > > > > > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > > > > > if (mode == record) { > > > > > log(reqid, time); > > > > > } else { > > > > > assert(mode == replay); > > > > > bool *done = req_replayed_list_get(reqid) > > > > > if (done) { > > > > > *done = true; > > > > > } else { > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > > qemu_coroutine_yield(); > > > > > } > > > > > } > > > > > } > > > > > > > > > > /* called by replay.c */ > > > > > int blkreplay_run_event() > > > > > { > > > > > if (mode == replay) { > > > > > co = req_completed_list_get(e.reqid); > > > > > if (co) { > > > > > qemu_coroutine_enter(co); > > > > > } else { > > > > > bool done = false; > > > > > req_replayed_list_insert(reqid, &done); > > > > > /* wait synchronously for completion */ > > > > > while (!done) { > > > > > aio_poll(); > > > > > } > > > > > } > > > > > } > > > > > } > > > > > > > > > > Where we could consider changing existing code is that it might be > > > > > desirable to automatically put an instance of this block driver on top > > > > > of every block device when record/replay is used. If we don't do that, > > > > > you need to explicitly specify -drive driver=blkreplay,... > > > > > > > > As far, as I understand, all synchronous read/write request are also passed > > > > through this coroutines layer. > > > > > > Yes, all read/write requests go through the same function internally, no > > > matter which external interface was used. > > > > > > > It means that every disk access in replay phase should match the recording phase. > > > > > > Right. If I'm not mistaken, this was the fundamental requirement you > > > have, so I wouldn't have suggested this otherwise. > > > > > > > Record/replay is intended to be used for debugging and analysis. > > > > When execution is replayed, guest machine cannot notice analysis overhead. > > > > Some analysis methods may include disk image reading. E.g., qemu-based > > > > analysis framework DECAF uses sleuthkit for disk forensics ( > > > https://github.com/sycurelab/DECAF ). > > > > If similar framework will be used with replay, forensics disk access operations > > > > won't work if we will record/replay the coroutines. > > > > > > Sorry, I'm not sure if I can follow. > > > > > > If such analysis software runs in the guest, it's not a replay any more > > > and I completely fail to see what you're doing. > > > > > > If it's a qemu component independent from the guest, then my method > > > gives you a clean way to bypass the replay driver that wouldn't be > > > possible with yours. > > > > The second one. qemu may be extended with some components that > > perform guest introspection. > > > > > If your plan was to record/replay only async requests and then use sync > > > requests to bypass the record/replay, let me clearly state that this is > > > the wrong approach: There are still guest devices which do synchronous > > > I/O and need to be considered in the replay log, and you shouldn't > > > prevent the analysis code from using AIO (in fact, using sync I/O in new > > > code is very much frowned upon). > > > > Why do guest synchronous requests have to be recorded? > > Aren't they completely deterministic? > > Good point. I think you're right in practice. In theory, with dataplane > (i.e. when running the request in a separate thread) it could happen, > but I guess that isn't very compatible with replay anyway - and at the > first sight I couldn't see it performing synchronous requests either. > > > > I can explain in more detail what the block device structure looks like > > > and how to access an image with and without record/replay, but first let > > > me please know whether I guessed right what your problem is. Or if I > > > missed your point, can you please describe in detail a case that > > > wouldn't work? > > > > You have understood it correctly. > > And what is the solution for bypassing one of the layers from component that > > should not affect the replay? > > For this, you need to understand how block drivers are stacked in qemu. > Each driver in the stack has a separate struct BlockDriverState, which > can be used to access its data. You could hook up things like this: > > virtio-blk NBD server > -------------- ------------ > | | > v | > +------------+ | > | blkreplay | | > +------------+ | > | | > v | > +------------+ | > | qcow2 | <---------------+ > +------------+ > | > v > +------------+ > | raw-posix | > +------------+ > | > v > -------------- > filesystem > > As you see, what I've chosen for the external analysis interface is just > an NBD server as this is the component that we already have today. You > could hook up any other (new) code there; the important part is that it > doesn't work on the BDS of the blkreplay driver, but directly on the BDS > of the qcow2 driver. > > On the command line, it could look like this (this assumes that we don't > add syntactic sugar that creates the blkreplay part automatically - we > can always do that): > > -drive file=image.qcow2,if=none,id=img-direct > -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay > -device virtio-blk-pci,drive=img-blkreplay Are there any hints for driver with these options? I can't figure out how to create _open function for that. blkdebug driver seems similar, but it receives image name directly, without referencing the lower level. Pavel Dovgalyuk
Am 12.02.2016 um 09:33 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 11.02.2016 um 12:00 hat Pavel Dovgalyuk geschrieben: > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > Am 11.02.2016 um 07:05 hat Pavel Dovgalyuk geschrieben: > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > > > > > > However, I don't understand yet which layer do you offer as the candidate > > > > > > > for record/replay? What functions should be changed? > > > > > > > I would like to investigate this way, but I don't got it yet. > > > > > > > > > > > > At the core, I wouldn't change any existing function, but introduce a > > > > > > new block driver. You could copy raw_bsd.c for a start and then tweak > > > > > > it. Leave out functions that you don't want to support, and add the > > > > > > necessary magic to .bdrv_co_readv/writev. > > > > > > > > > > > > Something like this (can probably be generalised for more than just > > > > > > reads as the part after the bdrv_co_reads() call should be the same for > > > > > > reads, writes and any other request types): > > > > > > > > > > > > int blkreplay_co_readv() > > > > > > { > > > > > > BlockReplayState *s = bs->opaque; > > > > > > int reqid = s->reqid++; > > > > > > > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > > > > > > > if (mode == record) { > > > > > > log(reqid, time); > > > > > > } else { > > > > > > assert(mode == replay); > > > > > > bool *done = req_replayed_list_get(reqid) > > > > > > if (done) { > > > > > > *done = true; > > > > > > } else { > > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > > > qemu_coroutine_yield(); > > > > > > } > > > > > > } > > > > > > } > > > > > > > > > > > > /* called by replay.c */ > > > > > > int blkreplay_run_event() > > > > > > { > > > > > > if (mode == replay) { > > > > > > co = req_completed_list_get(e.reqid); > > > > > > if (co) { > > > > > > qemu_coroutine_enter(co); > > > > > > } else { > > > > > > bool done = false; > > > > > > req_replayed_list_insert(reqid, &done); > > > > > > /* wait synchronously for completion */ > > > > > > while (!done) { > > > > > > aio_poll(); > > > > > > } > > > > > > } > > > > > > } > > > > > > } > > > > > > > > > > > > Where we could consider changing existing code is that it might be > > > > > > desirable to automatically put an instance of this block driver on top > > > > > > of every block device when record/replay is used. If we don't do that, > > > > > > you need to explicitly specify -drive driver=blkreplay,... > > > > > > > > > > As far, as I understand, all synchronous read/write request are also passed > > > > > through this coroutines layer. > > > > > > > > Yes, all read/write requests go through the same function internally, no > > > > matter which external interface was used. > > > > > > > > > It means that every disk access in replay phase should match the recording phase. > > > > > > > > Right. If I'm not mistaken, this was the fundamental requirement you > > > > have, so I wouldn't have suggested this otherwise. > > > > > > > > > Record/replay is intended to be used for debugging and analysis. > > > > > When execution is replayed, guest machine cannot notice analysis overhead. > > > > > Some analysis methods may include disk image reading. E.g., qemu-based > > > > > analysis framework DECAF uses sleuthkit for disk forensics ( > > > > https://github.com/sycurelab/DECAF ). > > > > > If similar framework will be used with replay, forensics disk access operations > > > > > won't work if we will record/replay the coroutines. > > > > > > > > Sorry, I'm not sure if I can follow. > > > > > > > > If such analysis software runs in the guest, it's not a replay any more > > > > and I completely fail to see what you're doing. > > > > > > > > If it's a qemu component independent from the guest, then my method > > > > gives you a clean way to bypass the replay driver that wouldn't be > > > > possible with yours. > > > > > > The second one. qemu may be extended with some components that > > > perform guest introspection. > > > > > > > If your plan was to record/replay only async requests and then use sync > > > > requests to bypass the record/replay, let me clearly state that this is > > > > the wrong approach: There are still guest devices which do synchronous > > > > I/O and need to be considered in the replay log, and you shouldn't > > > > prevent the analysis code from using AIO (in fact, using sync I/O in new > > > > code is very much frowned upon). > > > > > > Why do guest synchronous requests have to be recorded? > > > Aren't they completely deterministic? > > > > Good point. I think you're right in practice. In theory, with dataplane > > (i.e. when running the request in a separate thread) it could happen, > > but I guess that isn't very compatible with replay anyway - and at the > > first sight I couldn't see it performing synchronous requests either. > > > > > > I can explain in more detail what the block device structure looks like > > > > and how to access an image with and without record/replay, but first let > > > > me please know whether I guessed right what your problem is. Or if I > > > > missed your point, can you please describe in detail a case that > > > > wouldn't work? > > > > > > You have understood it correctly. > > > And what is the solution for bypassing one of the layers from component that > > > should not affect the replay? > > > > For this, you need to understand how block drivers are stacked in qemu. > > Each driver in the stack has a separate struct BlockDriverState, which > > can be used to access its data. You could hook up things like this: > > > > virtio-blk NBD server > > -------------- ------------ > > | | > > v | > > +------------+ | > > | blkreplay | | > > +------------+ | > > | | > > v | > > +------------+ | > > | qcow2 | <---------------+ > > +------------+ > > | > > v > > +------------+ > > | raw-posix | > > +------------+ > > | > > v > > -------------- > > filesystem > > > > As you see, what I've chosen for the external analysis interface is just > > an NBD server as this is the component that we already have today. You > > could hook up any other (new) code there; the important part is that it > > doesn't work on the BDS of the blkreplay driver, but directly on the BDS > > of the qcow2 driver. > > > > On the command line, it could look like this (this assumes that we don't > > add syntactic sugar that creates the blkreplay part automatically - we > > can always do that): > > > > -drive file=image.qcow2,if=none,id=img-direct > > -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay > > -device virtio-blk-pci,drive=img-blkreplay > > Are there any hints for driver with these options? > I can't figure out how to create _open function for that. > blkdebug driver seems similar, but it receives image name directly, without referencing > the lower level. Actually, the 'image' option of blkdebug works like this. For example, I just tried this commandline and it works: -drive file=/tmp/test.qcow2,if=none,id=img -drive driver=blkdebug,image=img,if=virtio Just ignore the legacy part with 'x-image' and pass NULL as the filename to bdrv_open_child(). Another driver that could be useful as an example is quorum. Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben: > > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben: > > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > But even this doesn't feel completely right, because block drivers are > > > > > > > already layered and there is no need to hardcode something optional (and > > > > > > > rarely used) in the hot code path that could just be another layer. > > > > > > > > > > > > > > I assume that you know beforehand if you want to replay something, so > > > > > > > requiring you to configure your block devices with a replay driver on > > > > > > > top of the stack seems reasonable enough. > > > > > > > > > > > > I cannot use block drivers for this. When driver functions are used, QEMU > > > > > > is already used coroutines (and probably started bottom halves). > > > > > > Coroutines make execution non-deterministic. > > > > > > That's why we have to intercept blk_aio_ functions, that are called > > > > > > deterministically. > > > > > > > > > > What does "deterministic" mean in this context, i.e. what are your exact > > > > > requirements? > > > > > > > > "Deterministic" means that the replayed execution should run exactly > > > > the same guest instructions in the same sequence, as in recording session. > > > > > > Okay. I think with this we can do better than what you have now. > > > > > > > > I don't think that coroutines introduce anything non-deterministic per > > > > > se. Depending on what you mean by it, the block layer code paths in > > > > > block.c may contain problematic code. > > > > > > > > They are non-deterministic if we need instruction-level accuracy. > > > > Thread switching (and therefore callbacks and BH execution) is non-deterministic. > > > > > > Thread switching depends on an external event (the kernel scheduler > > > deciding to switch), so agreed, if a thread switch ever influences what > > > the guest sees, that would be a problem. > > > > > > Generally, however, callbacks and BHs don't involve a thread switch at > > > all (BHs can be invoked from a different thread in theory, but we have > > > very few of those cases and they shouldn't be visible for the guest). > > > The same is true for coroutines, which are semantically equivalent to > > > callbacks. > > > > > > > In two different executions these callbacks may happen at different moments of > > > > time (counting in number of executed instructions). > > > > All operations with virtual devices (including memory, interrupt controller, > > > > and disk drive controller) should happen at deterministic moments of time > > > > to be replayable. > > > > > > Right, so let's talk about what this external non-deterministic event > > > really is. > > > > > > I think the only thing whose timing is unknown in the block layer is the > > > completion of I/O requests. This non-determinism comes from the time the > > > I/O syscalls made by the lowest layer (usually raw-posix) take. > > > > Right. > > > > > This means that we can add logic to remove the non-determinism at the > > > point of our choice between raw-posix and the guest device emulation. A > > > block driver on top is as good as anything else. > > > > > > While recording, this block driver would just pass the request to next > > > lower layer (starting a request is deterministic, so it doesn't need to > > > be logged) and once the request completes it logs it. While replaying, > > > the completion of requests is delayed until we read it in the log; if we > > > read it in the log and the request hasn't completed yet, we do a busy > > > wait for it (while(!completed) aio_poll();). > > > > I tried serializing all bottom halves and worker thread callbacks in > > previous version of the patches. That code was much more complicated > > and error-prone than the current version. We had to classify all bottom > > halves to recorded and non-recorded (because sometimes they are used > > for qemu's purposes, not the guest ones). > > > > However, I don't understand yet which layer do you offer as the candidate > > for record/replay? What functions should be changed? > > I would like to investigate this way, but I don't got it yet. > > At the core, I wouldn't change any existing function, but introduce a > new block driver. You could copy raw_bsd.c for a start and then tweak > it. Leave out functions that you don't want to support, and add the > necessary magic to .bdrv_co_readv/writev. > > Something like this (can probably be generalised for more than just > reads as the part after the bdrv_co_reads() call should be the same for > reads, writes and any other request types): > > int blkreplay_co_readv() > { > BlockReplayState *s = bs->opaque; > int reqid = s->reqid++; > > bdrv_co_readv(bs->file, ...); > > if (mode == record) { > log(reqid, time); > } else { > assert(mode == replay); > bool *done = req_replayed_list_get(reqid) > if (done) { > *done = true; > } else { point A > req_completed_list_insert(reqid, qemu_coroutine_self()); > qemu_coroutine_yield(); > } > } > } > > /* called by replay.c */ > int blkreplay_run_event() > { > if (mode == replay) { > co = req_completed_list_get(e.reqid); > if (co) { > qemu_coroutine_enter(co); > } else { > bool done = false; > req_replayed_list_insert(reqid, &done); point B > /* wait synchronously for completion */ > while (!done) { > aio_poll(); > } > } > } > } One more question about coroutines. Are race conditions possible in this sample? In replay mode we may call readv, and reach point A. On the same time, we will read point B in another thread. Then readv will yield and nobody will start it back? Pavel Dovgalyuk
Am 12.02.2016 um 14:19 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben: > > > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben: > > > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > > But even this doesn't feel completely right, because block drivers are > > > > > > > > already layered and there is no need to hardcode something optional (and > > > > > > > > rarely used) in the hot code path that could just be another layer. > > > > > > > > > > > > > > > > I assume that you know beforehand if you want to replay something, so > > > > > > > > requiring you to configure your block devices with a replay driver on > > > > > > > > top of the stack seems reasonable enough. > > > > > > > > > > > > > > I cannot use block drivers for this. When driver functions are used, QEMU > > > > > > > is already used coroutines (and probably started bottom halves). > > > > > > > Coroutines make execution non-deterministic. > > > > > > > That's why we have to intercept blk_aio_ functions, that are called > > > > > > > deterministically. > > > > > > > > > > > > What does "deterministic" mean in this context, i.e. what are your exact > > > > > > requirements? > > > > > > > > > > "Deterministic" means that the replayed execution should run exactly > > > > > the same guest instructions in the same sequence, as in recording session. > > > > > > > > Okay. I think with this we can do better than what you have now. > > > > > > > > > > I don't think that coroutines introduce anything non-deterministic per > > > > > > se. Depending on what you mean by it, the block layer code paths in > > > > > > block.c may contain problematic code. > > > > > > > > > > They are non-deterministic if we need instruction-level accuracy. > > > > > Thread switching (and therefore callbacks and BH execution) is non-deterministic. > > > > > > > > Thread switching depends on an external event (the kernel scheduler > > > > deciding to switch), so agreed, if a thread switch ever influences what > > > > the guest sees, that would be a problem. > > > > > > > > Generally, however, callbacks and BHs don't involve a thread switch at > > > > all (BHs can be invoked from a different thread in theory, but we have > > > > very few of those cases and they shouldn't be visible for the guest). > > > > The same is true for coroutines, which are semantically equivalent to > > > > callbacks. > > > > > > > > > In two different executions these callbacks may happen at different moments of > > > > > time (counting in number of executed instructions). > > > > > All operations with virtual devices (including memory, interrupt controller, > > > > > and disk drive controller) should happen at deterministic moments of time > > > > > to be replayable. > > > > > > > > Right, so let's talk about what this external non-deterministic event > > > > really is. > > > > > > > > I think the only thing whose timing is unknown in the block layer is the > > > > completion of I/O requests. This non-determinism comes from the time the > > > > I/O syscalls made by the lowest layer (usually raw-posix) take. > > > > > > Right. > > > > > > > This means that we can add logic to remove the non-determinism at the > > > > point of our choice between raw-posix and the guest device emulation. A > > > > block driver on top is as good as anything else. > > > > > > > > While recording, this block driver would just pass the request to next > > > > lower layer (starting a request is deterministic, so it doesn't need to > > > > be logged) and once the request completes it logs it. While replaying, > > > > the completion of requests is delayed until we read it in the log; if we > > > > read it in the log and the request hasn't completed yet, we do a busy > > > > wait for it (while(!completed) aio_poll();). > > > > > > I tried serializing all bottom halves and worker thread callbacks in > > > previous version of the patches. That code was much more complicated > > > and error-prone than the current version. We had to classify all bottom > > > halves to recorded and non-recorded (because sometimes they are used > > > for qemu's purposes, not the guest ones). > > > > > > However, I don't understand yet which layer do you offer as the candidate > > > for record/replay? What functions should be changed? > > > I would like to investigate this way, but I don't got it yet. > > > > At the core, I wouldn't change any existing function, but introduce a > > new block driver. You could copy raw_bsd.c for a start and then tweak > > it. Leave out functions that you don't want to support, and add the > > necessary magic to .bdrv_co_readv/writev. > > > > Something like this (can probably be generalised for more than just > > reads as the part after the bdrv_co_reads() call should be the same for > > reads, writes and any other request types): > > > > int blkreplay_co_readv() > > { > > BlockReplayState *s = bs->opaque; > > int reqid = s->reqid++; > > > > bdrv_co_readv(bs->file, ...); > > > > if (mode == record) { > > log(reqid, time); > > } else { > > assert(mode == replay); > > bool *done = req_replayed_list_get(reqid) > > if (done) { > > *done = true; > > } else { > point A > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > qemu_coroutine_yield(); > > } > > } > > } > > > > /* called by replay.c */ > > int blkreplay_run_event() > > { > > if (mode == replay) { > > co = req_completed_list_get(e.reqid); > > if (co) { > > qemu_coroutine_enter(co); > > } else { > > bool done = false; > > req_replayed_list_insert(reqid, &done); > point B > > /* wait synchronously for completion */ > > while (!done) { > > aio_poll(); > > } > > } > > } > > } > > One more question about coroutines. > Are race conditions possible in this sample? > In replay mode we may call readv, and reach point A. > On the same time, we will read point B in another thread. > Then readv will yield and nobody will start it back? There are two aspects to this: * Real multithreading doesn't exist in the block layer. All block driver functions are only called with the mutex in the AioContext held. There is exactly one AioContext per BDS, so no two threads can possible be operating on the same BDS at the same time. * Coroutines are different from threads in that they aren't preemptive. They are only interrupted in places where they explicitly yield. Of course, in order for this to work, we actually need to take the mutex before calling blkreplay_run_event(), which is called directly from the replay code (which runs in the mainloop thread? Or vcpu?). So I think you need to have a aio_context_acquire(bs->aio_context) and aio_context_release(bs->aio_context) around the function; either here or in the calling replay code. Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > int blkreplay_co_readv() > > > { > > > BlockReplayState *s = bs->opaque; > > > int reqid = s->reqid++; > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > if (mode == record) { > > > log(reqid, time); > > > } else { > > > assert(mode == replay); > > > bool *done = req_replayed_list_get(reqid) > > > if (done) { > > > *done = true; > > > } else { > > point A > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > qemu_coroutine_yield(); > > > } > > > } > > > } > > > > > > /* called by replay.c */ > > > int blkreplay_run_event() > > > { > > > if (mode == replay) { > > > co = req_completed_list_get(e.reqid); > > > if (co) { > > > qemu_coroutine_enter(co); > > > } else { > > > bool done = false; > > > req_replayed_list_insert(reqid, &done); > > point B > > > /* wait synchronously for completion */ > > > while (!done) { > > > aio_poll(); > > > } > > > } > > > } > > > } > > > > One more question about coroutines. > > Are race conditions possible in this sample? > > In replay mode we may call readv, and reach point A. > > On the same time, we will read point B in another thread. > > Then readv will yield and nobody will start it back? > > There are two aspects to this: > > * Real multithreading doesn't exist in the block layer. All block driver > functions are only called with the mutex in the AioContext held. There > is exactly one AioContext per BDS, so no two threads can possible be > operating on the same BDS at the same time. > > * Coroutines are different from threads in that they aren't preemptive. > They are only interrupted in places where they explicitly yield. > > Of course, in order for this to work, we actually need to take the mutex > before calling blkreplay_run_event(), which is called directly from the > replay code (which runs in the mainloop thread? Or vcpu?). blkreplay_run_event() is called from replay code which is protected by mutex. This function may be called from io and vcpu threads, because both of them have replay functions invocations. > So I think you need to have a aio_context_acquire(bs->aio_context) and > aio_context_release(bs->aio_context) around the function; either here or > in the calling replay code. And what about coroutine code? Does it call aio_context_acquire somewhere? Pavel Dovgalyuk
Am 15.02.2016 um 09:38 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > > int blkreplay_co_readv() > > > > { > > > > BlockReplayState *s = bs->opaque; > > > > int reqid = s->reqid++; > > > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > > > if (mode == record) { > > > > log(reqid, time); > > > > } else { > > > > assert(mode == replay); > > > > bool *done = req_replayed_list_get(reqid) > > > > if (done) { > > > > *done = true; > > > > } else { > > > point A > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > qemu_coroutine_yield(); > > > > } > > > > } > > > > } > > > > > > > > /* called by replay.c */ > > > > int blkreplay_run_event() > > > > { > > > > if (mode == replay) { > > > > co = req_completed_list_get(e.reqid); > > > > if (co) { > > > > qemu_coroutine_enter(co); > > > > } else { > > > > bool done = false; > > > > req_replayed_list_insert(reqid, &done); > > > point B > > > > /* wait synchronously for completion */ > > > > while (!done) { > > > > aio_poll(); > > > > } > > > > } > > > > } > > > > } > > > > > > One more question about coroutines. > > > Are race conditions possible in this sample? > > > In replay mode we may call readv, and reach point A. > > > On the same time, we will read point B in another thread. > > > Then readv will yield and nobody will start it back? > > > > There are two aspects to this: > > > > * Real multithreading doesn't exist in the block layer. All block driver > > functions are only called with the mutex in the AioContext held. There > > is exactly one AioContext per BDS, so no two threads can possible be > > operating on the same BDS at the same time. > > > > * Coroutines are different from threads in that they aren't preemptive. > > They are only interrupted in places where they explicitly yield. > > > > Of course, in order for this to work, we actually need to take the mutex > > before calling blkreplay_run_event(), which is called directly from the > > replay code (which runs in the mainloop thread? Or vcpu?). > > blkreplay_run_event() is called from replay code which is protected by mutex. > This function may be called from io and vcpu threads, because both of them > have replay functions invocations. I assume that the mutex you mean is the BQL. This is enough for cases which use the main loop AioContext, but not for block devices with dataplane threads. > > So I think you need to have a aio_context_acquire(bs->aio_context) and > > aio_context_release(bs->aio_context) around the function; either here or > > in the calling replay code. > > And what about coroutine code? Does it call aio_context_acquire somewhere? The coroutine code doesn't call aio_context_acquire() by itself. But coroutines may only be entered when the appropriate AioContext lock is held, so they are always protected. Kevin
> From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru] > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > > int blkreplay_co_readv() > > > > { > > > > BlockReplayState *s = bs->opaque; > > > > int reqid = s->reqid++; > > > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > > > if (mode == record) { > > > > log(reqid, time); > > > > } else { > > > > assert(mode == replay); > > > > bool *done = req_replayed_list_get(reqid) > > > > if (done) { > > > > *done = true; > > > > } else { > > > point A > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > qemu_coroutine_yield(); > > > > } > > > > } > > > > } > > > > > > > > /* called by replay.c */ > > > > int blkreplay_run_event() > > > > { > > > > if (mode == replay) { > > > > co = req_completed_list_get(e.reqid); > > > > if (co) { > > > > qemu_coroutine_enter(co); > > > > } else { > > > > bool done = false; > > > > req_replayed_list_insert(reqid, &done); > > > point B > > > > /* wait synchronously for completion */ > > > > while (!done) { > > > > aio_poll(); > > > > } > > > > } > > > > } > > > > } > > > > > > One more question about coroutines. > > > Are race conditions possible in this sample? > > > In replay mode we may call readv, and reach point A. > > > On the same time, we will read point B in another thread. > > > Then readv will yield and nobody will start it back? > > > > There are two aspects to this: > > > > * Real multithreading doesn't exist in the block layer. All block driver > > functions are only called with the mutex in the AioContext held. There > > is exactly one AioContext per BDS, so no two threads can possible be > > operating on the same BDS at the same time. > > > > * Coroutines are different from threads in that they aren't preemptive. > > They are only interrupted in places where they explicitly yield. > > > > Of course, in order for this to work, we actually need to take the mutex > > before calling blkreplay_run_event(), which is called directly from the > > replay code (which runs in the mainloop thread? Or vcpu?). > > blkreplay_run_event() is called from replay code which is protected by mutex. > This function may be called from io and vcpu threads, because both of them > have replay functions invocations. Now I've encountered a situation where blkreplay_run_event is called from read coroutine: bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer) bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation. This is probably because aio_poll has taken the aio context? How can I resolve this? Pavel Dovgalyuk
Am 15.02.2016 um 10:14 hat Pavel Dovgalyuk geschrieben: > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru] > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > > > > int blkreplay_co_readv() > > > > > { > > > > > BlockReplayState *s = bs->opaque; > > > > > int reqid = s->reqid++; > > > > > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > > > > > if (mode == record) { > > > > > log(reqid, time); > > > > > } else { > > > > > assert(mode == replay); > > > > > bool *done = req_replayed_list_get(reqid) > > > > > if (done) { > > > > > *done = true; > > > > > } else { > > > > point A > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > > qemu_coroutine_yield(); > > > > > } > > > > > } > > > > > } > > > > > > > > > > /* called by replay.c */ > > > > > int blkreplay_run_event() > > > > > { > > > > > if (mode == replay) { > > > > > co = req_completed_list_get(e.reqid); > > > > > if (co) { > > > > > qemu_coroutine_enter(co); > > > > > } else { > > > > > bool done = false; > > > > > req_replayed_list_insert(reqid, &done); > > > > point B > > > > > /* wait synchronously for completion */ > > > > > while (!done) { > > > > > aio_poll(); > > > > > } > > > > > } > > > > > } > > > > > } > > > > > > > > One more question about coroutines. > > > > Are race conditions possible in this sample? > > > > In replay mode we may call readv, and reach point A. > > > > On the same time, we will read point B in another thread. > > > > Then readv will yield and nobody will start it back? > > > > > > There are two aspects to this: > > > > > > * Real multithreading doesn't exist in the block layer. All block driver > > > functions are only called with the mutex in the AioContext held. There > > > is exactly one AioContext per BDS, so no two threads can possible be > > > operating on the same BDS at the same time. > > > > > > * Coroutines are different from threads in that they aren't preemptive. > > > They are only interrupted in places where they explicitly yield. > > > > > > Of course, in order for this to work, we actually need to take the mutex > > > before calling blkreplay_run_event(), which is called directly from the > > > replay code (which runs in the mainloop thread? Or vcpu?). > > > > blkreplay_run_event() is called from replay code which is protected by mutex. > > This function may be called from io and vcpu threads, because both of them > > have replay functions invocations. > > Now I've encountered a situation where blkreplay_run_event is called from read coroutine: > bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event > \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer) > > bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation. > This is probably because aio_poll has taken the aio context? > How can I resolve this? First of all, I'm not sure if running replay events from qemu_clock_get_ns() is such a great idea. This is not a function that callers expect to change any state. If you absolutely have to do it there instead of in the clock device emulations, maybe restricting it to replaying clock events could make it a bit more harmless. Anyway, what does "can't proceed" mean? The coroutine yields because it's waiting for I/O, but it is never reentered? Or is it hanging while trying to acquire a lock? Calling the callbacks that reenter a yielded coroutine is generally the job of aio_poll(). After reentering the coroutine, blkreplay_run_event() should return back to its caller and therefore indirectly to aio_poll(), which should drive the events. Sounds like it should be working. Can you provide more detail about the exact place where it's hanging, both in the coroutine and in the main "coroutine" that executes aio_poll()? Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 15.02.2016 um 10:14 hat Pavel Dovgalyuk geschrieben: > > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru] > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > > > > > > int blkreplay_co_readv() > > > > > > { > > > > > > BlockReplayState *s = bs->opaque; > > > > > > int reqid = s->reqid++; > > > > > > > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > > > > > > > if (mode == record) { > > > > > > log(reqid, time); > > > > > > } else { > > > > > > assert(mode == replay); > > > > > > bool *done = req_replayed_list_get(reqid) > > > > > > if (done) { > > > > > > *done = true; > > > > > > } else { > > > > > point A > > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > > > qemu_coroutine_yield(); > > > > > > } > > > > > > } > > > > > > } > > > > > > > > > > > > /* called by replay.c */ > > > > > > int blkreplay_run_event() > > > > > > { > > > > > > if (mode == replay) { > > > > > > co = req_completed_list_get(e.reqid); > > > > > > if (co) { > > > > > > qemu_coroutine_enter(co); > > > > > > } else { > > > > > > bool done = false; > > > > > > req_replayed_list_insert(reqid, &done); > > > > > point B > > > > > > /* wait synchronously for completion */ > > > > > > while (!done) { > > > > > > aio_poll(); > > > > > > } > > > > > > } > > > > > > } > > > > > > } > > > > > > Now I've encountered a situation where blkreplay_run_event is called from read coroutine: > > bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event > > \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer) > > > > bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation. > > This is probably because aio_poll has taken the aio context? > > How can I resolve this? > > First of all, I'm not sure if running replay events from > qemu_clock_get_ns() is such a great idea. This is not a function that > callers expect to change any state. If you absolutely have to do it > there instead of in the clock device emulations, maybe restricting it to > replaying clock events could make it a bit more harmless. qemu_clock_get_ns() wants to read some clock data from the log. While reading, it finds block driver event and tries to proceed it. These block events may occur at any moment, because blkreplay_co_readv() writes them immediately as executes. Alternative approach is adding these events to the queue and executing them when checkpoint will be met. I'm not sure that this is easy to implement with coroutines. > Anyway, what does "can't proceed" mean? The coroutine yields because > it's waiting for I/O, but it is never reentered? Or is it hanging while > trying to acquire a lock? bdrv_co_io_em() (raw file read/write) yields and waits inifinitely to return, because aio_poll hands in some clock read. > Calling the callbacks that reenter a yielded coroutine is generally the > job of aio_poll(). After reentering the coroutine, blkreplay_run_event() > should return back to its caller and therefore indirectly to aio_poll(), > which should drive the events. Sounds like it should be working. Yield occurred inside blkreplay_co_readv()->bdrv_co_readv() before the request was added to the queue. Pavel Dovgalyuk
Am 15.02.2016 um 12:19 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 15.02.2016 um 10:14 hat Pavel Dovgalyuk geschrieben: > > > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru] > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > > > > > > > > int blkreplay_co_readv() > > > > > > > { > > > > > > > BlockReplayState *s = bs->opaque; > > > > > > > int reqid = s->reqid++; > > > > > > > > > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > > > > > > > > > if (mode == record) { > > > > > > > log(reqid, time); > > > > > > > } else { > > > > > > > assert(mode == replay); > > > > > > > bool *done = req_replayed_list_get(reqid) > > > > > > > if (done) { > > > > > > > *done = true; > > > > > > > } else { > > > > > > point A > > > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > > > > qemu_coroutine_yield(); > > > > > > > } > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > /* called by replay.c */ > > > > > > > int blkreplay_run_event() > > > > > > > { > > > > > > > if (mode == replay) { > > > > > > > co = req_completed_list_get(e.reqid); > > > > > > > if (co) { > > > > > > > qemu_coroutine_enter(co); > > > > > > > } else { > > > > > > > bool done = false; > > > > > > > req_replayed_list_insert(reqid, &done); > > > > > > point B > > > > > > > /* wait synchronously for completion */ > > > > > > > while (!done) { > > > > > > > aio_poll(); > > > > > > > } > > > > > > > } > > > > > > > } > > > > > > > } > > > > > > > > Now I've encountered a situation where blkreplay_run_event is called from read coroutine: > > > bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event > > > \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer) > > > > > > bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation. > > > This is probably because aio_poll has taken the aio context? > > > How can I resolve this? > > > > First of all, I'm not sure if running replay events from > > qemu_clock_get_ns() is such a great idea. This is not a function that > > callers expect to change any state. If you absolutely have to do it > > there instead of in the clock device emulations, maybe restricting it to > > replaying clock events could make it a bit more harmless. > > qemu_clock_get_ns() wants to read some clock data from the log. > While reading, it finds block driver event and tries to proceed it. > These block events may occur at any moment, because blkreplay_co_readv() > writes them immediately as executes. I understand what's happening. I just think it's asking for bugs when you do this in a low-level function like qemu_clock_get_ns() that nobody expects to have any side effects. But we'll see how many bugs this will cause. > Alternative approach is adding these events to the queue and executing them > when checkpoint will be met. I'm not sure that this is easy to implement with > coroutines. Shouldn't be hard, you're queueing requests already. But it's unlikely to be the right solution. Why does qemu_clock_get_ns() need to replay events to begin with? Shouldn't it just return the current state without replaying anything new? > > Anyway, what does "can't proceed" mean? The coroutine yields because > > it's waiting for I/O, but it is never reentered? Or is it hanging while > > trying to acquire a lock? > > bdrv_co_io_em() (raw file read/write) yields and waits inifinitely to return, because > aio_poll hands in some clock read. Okay, so the coroutine is sleeping while it waits for I/O to complete. What's missing is the I/O completion callback. When the coroutine yields in bdrv_co_io_em(), the caller coroutines continues execution at blkreplay_run_event(). My naive expectation would be that it returns to replay_read_clock(), which in turn returns to qemu_clock_get_ns(), which knows the right time now and returns to aio_poll(), which would process I/O completions and call the callback. The callback would reenter the other coroutine in bdrv_co_io_em() and the request would be completed. As the callback isn't happening, something in this unwinding process don't work as expected and we're actually hanging somewhere else. Do you know where that is? Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 15.02.2016 um 10:14 hat Pavel Dovgalyuk geschrieben: > > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru] > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > > > > > > int blkreplay_co_readv() > > > > > > { > > > > > > BlockReplayState *s = bs->opaque; > > > > > > int reqid = s->reqid++; > > > > > > > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > > > > > > > if (mode == record) { > > > > > > log(reqid, time); > > > > > > } else { > > > > > > assert(mode == replay); > > > > > > bool *done = req_replayed_list_get(reqid) > > > > > > if (done) { > > > > > > *done = true; > > > > > > } else { > > > > > point A > > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > > > qemu_coroutine_yield(); > > > > > > } > > > > > > } > > > > > > } > > > > > > > > > > > > /* called by replay.c */ > > > > > > int blkreplay_run_event() > > > > > > { > > > > > > if (mode == replay) { > > > > > > co = req_completed_list_get(e.reqid); > > > > > > if (co) { > > > > > > qemu_coroutine_enter(co); > > > > > > } else { > > > > > > bool done = false; > > > > > > req_replayed_list_insert(reqid, &done); > > > > > point B > > > > > > /* wait synchronously for completion */ > > > > > > while (!done) { > > > > > > aio_poll(); > > > > > > } > > > > > > } > > > > > > } > > > > > > } > > > > > > > > > > One more question about coroutines. > > > > > Are race conditions possible in this sample? > > > > > In replay mode we may call readv, and reach point A. > > > > > On the same time, we will read point B in another thread. > > > > > Then readv will yield and nobody will start it back? > > > > > > > > There are two aspects to this: > > > > > > > > * Real multithreading doesn't exist in the block layer. All block driver > > > > functions are only called with the mutex in the AioContext held. There > > > > is exactly one AioContext per BDS, so no two threads can possible be > > > > operating on the same BDS at the same time. > > > > > > > > * Coroutines are different from threads in that they aren't preemptive. > > > > They are only interrupted in places where they explicitly yield. > > > > > > > > Of course, in order for this to work, we actually need to take the mutex > > > > before calling blkreplay_run_event(), which is called directly from the > > > > replay code (which runs in the mainloop thread? Or vcpu?). > > > > > > blkreplay_run_event() is called from replay code which is protected by mutex. > > > This function may be called from io and vcpu threads, because both of them > > > have replay functions invocations. > > > > Now I've encountered a situation where blkreplay_run_event is called from read coroutine: > > bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event > > \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer) > > > > bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation. > > This is probably because aio_poll has taken the aio context? > > How can I resolve this? > > First of all, I'm not sure if running replay events from > qemu_clock_get_ns() is such a great idea. This is not a function that > callers expect to change any state. If you absolutely have to do it > there instead of in the clock device emulations, maybe restricting it to > replaying clock events could make it a bit more harmless. Only virtual clock is emulated, and host clock is read from the host real time sources and therefore has to be saved into the log. There could be asynchronous events that occur in non-cpu threads. For now these events are shutdown request and block task execution. They may "hide" following clock (or another one) events. That is why we process them until synchronous event (like clock, instructions execution, or checkpoint) is met. > Anyway, what does "can't proceed" mean? The coroutine yields because > it's waiting for I/O, but it is never reentered? Or is it hanging while > trying to acquire a lock? I've solved this problem by slightly modifying the queue. I haven't yet made BlockDriverState assignment to the request ids. Therefore aio_poll was temporarily replaced with usleep. Now execution starts and hangs at some random moment of OS loading. Here is the current version of blkreplay functions: static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) { uint32_t reqid = request_id++; Request *req; req = block_request_insert(reqid, bs, qemu_coroutine_self()); bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); if (replay_mode == REPLAY_MODE_RECORD) { replay_save_block_event(reqid); } else { assert(replay_mode == REPLAY_MODE_PLAY); qemu_coroutine_yield(); } block_request_remove(req); return 0; } void replay_run_block_event(uint32_t id) { Request *req; if (replay_mode == REPLAY_MODE_PLAY) { while (!(req = block_request_find(id))) { //aio_poll(bdrv_get_aio_context(req->bs), true); usleep(1); } qemu_coroutine_enter(req->co, NULL); } } > Can you provide more detail about the exact place where it's hanging, > both in the coroutine and in the main "coroutine" that executes > aio_poll()? In this version replay_run_block_event() executes while loop. I haven't found what other threads do, because the debugger doesn't show me call stack when thread is waiting in some blocking function. Pavel Dovgalyuk
Am 15.02.2016 um 14:54 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 15.02.2016 um 10:14 hat Pavel Dovgalyuk geschrieben: > > > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru] > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > > > > > > > > int blkreplay_co_readv() > > > > > > > { > > > > > > > BlockReplayState *s = bs->opaque; > > > > > > > int reqid = s->reqid++; > > > > > > > > > > > > > > bdrv_co_readv(bs->file, ...); > > > > > > > > > > > > > > if (mode == record) { > > > > > > > log(reqid, time); > > > > > > > } else { > > > > > > > assert(mode == replay); > > > > > > > bool *done = req_replayed_list_get(reqid) > > > > > > > if (done) { > > > > > > > *done = true; > > > > > > > } else { > > > > > > point A > > > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > > > > qemu_coroutine_yield(); > > > > > > > } > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > /* called by replay.c */ > > > > > > > int blkreplay_run_event() > > > > > > > { > > > > > > > if (mode == replay) { > > > > > > > co = req_completed_list_get(e.reqid); > > > > > > > if (co) { > > > > > > > qemu_coroutine_enter(co); > > > > > > > } else { > > > > > > > bool done = false; > > > > > > > req_replayed_list_insert(reqid, &done); > > > > > > point B > > > > > > > /* wait synchronously for completion */ > > > > > > > while (!done) { > > > > > > > aio_poll(); > > > > > > > } > > > > > > > } > > > > > > > } > > > > > > > } > > > > > > > > > > > > One more question about coroutines. > > > > > > Are race conditions possible in this sample? > > > > > > In replay mode we may call readv, and reach point A. > > > > > > On the same time, we will read point B in another thread. > > > > > > Then readv will yield and nobody will start it back? > > > > > > > > > > There are two aspects to this: > > > > > > > > > > * Real multithreading doesn't exist in the block layer. All block driver > > > > > functions are only called with the mutex in the AioContext held. There > > > > > is exactly one AioContext per BDS, so no two threads can possible be > > > > > operating on the same BDS at the same time. > > > > > > > > > > * Coroutines are different from threads in that they aren't preemptive. > > > > > They are only interrupted in places where they explicitly yield. > > > > > > > > > > Of course, in order for this to work, we actually need to take the mutex > > > > > before calling blkreplay_run_event(), which is called directly from the > > > > > replay code (which runs in the mainloop thread? Or vcpu?). > > > > > > > > blkreplay_run_event() is called from replay code which is protected by mutex. > > > > This function may be called from io and vcpu threads, because both of them > > > > have replay functions invocations. > > > > > > Now I've encountered a situation where blkreplay_run_event is called from read coroutine: > > > bdrv_prwv_co -> aio_poll -> qemu_clock_get_ns -> replay_read_clock -> blkreplay_run_event > > > \--> bdrv_co_readv -> blkreplay_co_readv -> bdrv_co_readv(lower layer) > > > > > > bdrv_co_readv inside blkreplay_co_readv can't proceed in this situation. > > > This is probably because aio_poll has taken the aio context? > > > How can I resolve this? > > > > First of all, I'm not sure if running replay events from > > qemu_clock_get_ns() is such a great idea. This is not a function that > > callers expect to change any state. If you absolutely have to do it > > there instead of in the clock device emulations, maybe restricting it to > > replaying clock events could make it a bit more harmless. > > Only virtual clock is emulated, and host clock is read from the host > real time sources and therefore has to be saved into the log. Isn't the host clock invisible to the guest anyway? > There could be asynchronous events that occur in non-cpu threads. > For now these events are shutdown request and block task execution. > They may "hide" following clock (or another one) events. That is why > we process them until synchronous event (like clock, instructions > execution, or checkpoint) is met. > > > > Anyway, what does "can't proceed" mean? The coroutine yields because > > it's waiting for I/O, but it is never reentered? Or is it hanging while > > trying to acquire a lock? > > I've solved this problem by slightly modifying the queue. > I haven't yet made BlockDriverState assignment to the request ids. > Therefore aio_poll was temporarily replaced with usleep. > Now execution starts and hangs at some random moment of OS loading. > > Here is the current version of blkreplay functions: > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, > int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) > { > uint32_t reqid = request_id++; > Request *req; > req = block_request_insert(reqid, bs, qemu_coroutine_self()); > bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); > > if (replay_mode == REPLAY_MODE_RECORD) { > replay_save_block_event(reqid); > } else { > assert(replay_mode == REPLAY_MODE_PLAY); > qemu_coroutine_yield(); > } > block_request_remove(req); > > return 0; > } > > void replay_run_block_event(uint32_t id) > { > Request *req; > if (replay_mode == REPLAY_MODE_PLAY) { > while (!(req = block_request_find(id))) { > //aio_poll(bdrv_get_aio_context(req->bs), true); > usleep(1); > } How is this loop supposed to make any progress? And I still don't understand why aio_poll() doesn't work and where it hangs. Kevin > qemu_coroutine_enter(req->co, NULL); > } > } > > > Can you provide more detail about the exact place where it's hanging, > > both in the coroutine and in the main "coroutine" that executes > > aio_poll()? > > In this version replay_run_block_event() executes while loop. > I haven't found what other threads do, because the debugger doesn't show me > call stack when thread is waiting in some blocking function. > > Pavel Dovgalyuk >
> From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > First of all, I'm not sure if running replay events from > > > qemu_clock_get_ns() is such a great idea. This is not a function that > > > callers expect to change any state. If you absolutely have to do it > > > there instead of in the clock device emulations, maybe restricting it to > > > replaying clock events could make it a bit more harmless. > > > > Only virtual clock is emulated, and host clock is read from the host > > real time sources and therefore has to be saved into the log. > > Isn't the host clock invisible to the guest anyway? It isn't. Host clock is used by guest RTC. > > There could be asynchronous events that occur in non-cpu threads. > > For now these events are shutdown request and block task execution. > > They may "hide" following clock (or another one) events. That is why > > we process them until synchronous event (like clock, instructions > > execution, or checkpoint) is met. > > > > > > > Anyway, what does "can't proceed" mean? The coroutine yields because > > > it's waiting for I/O, but it is never reentered? Or is it hanging while > > > trying to acquire a lock? > > > > I've solved this problem by slightly modifying the queue. > > I haven't yet made BlockDriverState assignment to the request ids. > > Therefore aio_poll was temporarily replaced with usleep. > > Now execution starts and hangs at some random moment of OS loading. > > > > Here is the current version of blkreplay functions: > > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, > > int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) > > { > > uint32_t reqid = request_id++; > > Request *req; > > req = block_request_insert(reqid, bs, qemu_coroutine_self()); > > bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); > > > > if (replay_mode == REPLAY_MODE_RECORD) { > > replay_save_block_event(reqid); > > } else { > > assert(replay_mode == REPLAY_MODE_PLAY); > > qemu_coroutine_yield(); > > } > > block_request_remove(req); > > > > return 0; > > } > > > > void replay_run_block_event(uint32_t id) > > { > > Request *req; > > if (replay_mode == REPLAY_MODE_PLAY) { > > while (!(req = block_request_find(id))) { > > //aio_poll(bdrv_get_aio_context(req->bs), true); > > usleep(1); > > } > > How is this loop supposed to make any progress? This loop does not supposed to make any progress. It waits until block_request_insert call is added to the queue. > And I still don't understand why aio_poll() doesn't work and where it > hangs. aio_poll hangs if "req = block_request_insert(reqid, bs, qemu_coroutine_self());" line is executed after bdrv_co_readv. When bdrv_co_readv yields, replay_run_block_event has no information about pending request and cannot jump to its coroutine. Maybe I should implement aio_poll execution there to make progress in that case? > > qemu_coroutine_enter(req->co, NULL); > > } > > } Pavel Dovgalyuk
> From: Kevin Wolf [mailto:kwolf@redhat.com] > > void replay_run_block_event(uint32_t id) > > { > > Request *req; > > if (replay_mode == REPLAY_MODE_PLAY) { > > while (!(req = block_request_find(id))) { > > //aio_poll(bdrv_get_aio_context(req->bs), true); > > usleep(1); > > } > > > qemu_coroutine_enter(req->co, NULL); > > } > > } > > > > > Can you provide more detail about the exact place where it's hanging, > > > both in the coroutine and in the main "coroutine" that executes > > > aio_poll()? I've tried to replace usleep with aio_poll. In this case replay cannot be made, because of recursive mutex lock: aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> replay_run_block_event -> aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> <assert failed> Pavel Dovgalyuk
Am 15.02.2016 um 15:24 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > > First of all, I'm not sure if running replay events from > > > > qemu_clock_get_ns() is such a great idea. This is not a function that > > > > callers expect to change any state. If you absolutely have to do it > > > > there instead of in the clock device emulations, maybe restricting it to > > > > replaying clock events could make it a bit more harmless. > > > > > > Only virtual clock is emulated, and host clock is read from the host > > > real time sources and therefore has to be saved into the log. > > > > Isn't the host clock invisible to the guest anyway? > > It isn't. Host clock is used by guest RTC. > > > > There could be asynchronous events that occur in non-cpu threads. > > > For now these events are shutdown request and block task execution. > > > They may "hide" following clock (or another one) events. That is why > > > we process them until synchronous event (like clock, instructions > > > execution, or checkpoint) is met. > > > > > > > > > > Anyway, what does "can't proceed" mean? The coroutine yields because > > > > it's waiting for I/O, but it is never reentered? Or is it hanging while > > > > trying to acquire a lock? > > > > > > I've solved this problem by slightly modifying the queue. > > > I haven't yet made BlockDriverState assignment to the request ids. > > > Therefore aio_poll was temporarily replaced with usleep. > > > Now execution starts and hangs at some random moment of OS loading. > > > > > > Here is the current version of blkreplay functions: > > > > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, > > > int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) > > > { > > > uint32_t reqid = request_id++; > > > Request *req; > > > req = block_request_insert(reqid, bs, qemu_coroutine_self()); > > > bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); > > > > > > if (replay_mode == REPLAY_MODE_RECORD) { > > > replay_save_block_event(reqid); > > > } else { > > > assert(replay_mode == REPLAY_MODE_PLAY); > > > qemu_coroutine_yield(); > > > } > > > block_request_remove(req); > > > > > > return 0; > > > } > > > > > > void replay_run_block_event(uint32_t id) > > > { > > > Request *req; > > > if (replay_mode == REPLAY_MODE_PLAY) { > > > while (!(req = block_request_find(id))) { > > > //aio_poll(bdrv_get_aio_context(req->bs), true); > > > usleep(1); > > > } > > > > How is this loop supposed to make any progress? > > This loop does not supposed to make any progress. It waits until block_request_insert > call is added to the queue. Yes. And who is supposed to add something to the queue? You are looping and doing nothing, so no other code gets to run in this thread. It's only aio_poll() that would run event handlers that could eventually add something to the list. > > And I still don't understand why aio_poll() doesn't work and where it > > hangs. > > aio_poll hangs if "req = block_request_insert(reqid, bs, qemu_coroutine_self());" line > is executed after bdrv_co_readv. When bdrv_co_readv yields, replay_run_block_event has no > information about pending request and cannot jump to its coroutine. > Maybe I should implement aio_poll execution there to make progress in that case? Up in the thread, I posted code that was a bit more complex than what you have, and which considered both cases (completion before replay/completion after replay). If you implement it like this, it might just work. And yes, it involved an aio_poll() loop in the replay function. Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 15.02.2016 um 15:24 hat Pavel Dovgalyuk geschrieben: > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > There could be asynchronous events that occur in non-cpu threads. > > > > For now these events are shutdown request and block task execution. > > > > They may "hide" following clock (or another one) events. That is why > > > > we process them until synchronous event (like clock, instructions > > > > execution, or checkpoint) is met. > > > > > > > > > > > > > Anyway, what does "can't proceed" mean? The coroutine yields because > > > > > it's waiting for I/O, but it is never reentered? Or is it hanging while > > > > > trying to acquire a lock? > > > > > > > > I've solved this problem by slightly modifying the queue. > > > > I haven't yet made BlockDriverState assignment to the request ids. > > > > Therefore aio_poll was temporarily replaced with usleep. > > > > Now execution starts and hangs at some random moment of OS loading. > > > > > > > > Here is the current version of blkreplay functions: > > > > > > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, > > > > int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) > > > > { > > > > uint32_t reqid = request_id++; > > > > Request *req; > > > > req = block_request_insert(reqid, bs, qemu_coroutine_self()); > > > > bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); > > > > > > > > if (replay_mode == REPLAY_MODE_RECORD) { > > > > replay_save_block_event(reqid); > > > > } else { > > > > assert(replay_mode == REPLAY_MODE_PLAY); > > > > qemu_coroutine_yield(); > > > > } > > > > block_request_remove(req); > > > > > > > > return 0; > > > > } > > > > > > > > void replay_run_block_event(uint32_t id) > > > > { > > > > Request *req; > > > > if (replay_mode == REPLAY_MODE_PLAY) { > > > > while (!(req = block_request_find(id))) { > > > > //aio_poll(bdrv_get_aio_context(req->bs), true); > > > > usleep(1); > > > > } > > > > > > How is this loop supposed to make any progress? > > > > This loop does not supposed to make any progress. It waits until block_request_insert > > call is added to the queue. > > Yes. And who is supposed to add something to the queue? You are looping > and doing nothing, so no other code gets to run in this thread. It's > only aio_poll() that would run event handlers that could eventually add > something to the list. blkreplay_co_readv adds request to the queue. > > > > And I still don't understand why aio_poll() doesn't work and where it > > > hangs. > > > > aio_poll hangs if "req = block_request_insert(reqid, bs, qemu_coroutine_self());" line > > is executed after bdrv_co_readv. When bdrv_co_readv yields, replay_run_block_event has no > > information about pending request and cannot jump to its coroutine. > > Maybe I should implement aio_poll execution there to make progress in that case? > > Up in the thread, I posted code that was a bit more complex than what > you have, and which considered both cases (completion before > replay/completion after replay). If you implement it like this, it might > just work. And yes, it involved an aio_poll() loop in the replay > function. You are right, but I found kind of fundamental problem here. There are two possible approaches for replaying coroutine events: 1. Waiting with aio_poll in replay_run_block_event. In this case replay cannot be made, because of recursive mutex lock: aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> replay_run_block_event -> aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> <assert failed> I.e. we have recursive aio_poll function calls that lead to recursive replay calls and to recursive mutex lock. 2. Adding block events to the queue until checkpoint is met. This will not work with synchronous requests, because they will wait infinitely and checkpoint will never happen. Correct me, if I'm not right for this case. If blkreplay_co_readv will yield, can vcpu thread unlock the BQL and wait for the checkpoint in iothread? Pavel Dovgalyuk
Am 16.02.2016 um 07:25 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 15.02.2016 um 15:24 hat Pavel Dovgalyuk geschrieben: > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > > There could be asynchronous events that occur in non-cpu threads. > > > > > For now these events are shutdown request and block task execution. > > > > > They may "hide" following clock (or another one) events. That is why > > > > > we process them until synchronous event (like clock, instructions > > > > > execution, or checkpoint) is met. > > > > > > > > > > > > > > > > Anyway, what does "can't proceed" mean? The coroutine yields because > > > > > > it's waiting for I/O, but it is never reentered? Or is it hanging while > > > > > > trying to acquire a lock? > > > > > > > > > > I've solved this problem by slightly modifying the queue. > > > > > I haven't yet made BlockDriverState assignment to the request ids. > > > > > Therefore aio_poll was temporarily replaced with usleep. > > > > > Now execution starts and hangs at some random moment of OS loading. > > > > > > > > > > Here is the current version of blkreplay functions: > > > > > > > > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, > > > > > int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) > > > > > { > > > > > uint32_t reqid = request_id++; > > > > > Request *req; > > > > > req = block_request_insert(reqid, bs, qemu_coroutine_self()); > > > > > bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); > > > > > > > > > > if (replay_mode == REPLAY_MODE_RECORD) { > > > > > replay_save_block_event(reqid); > > > > > } else { > > > > > assert(replay_mode == REPLAY_MODE_PLAY); > > > > > qemu_coroutine_yield(); > > > > > } > > > > > block_request_remove(req); > > > > > > > > > > return 0; > > > > > } > > > > > > > > > > void replay_run_block_event(uint32_t id) > > > > > { > > > > > Request *req; > > > > > if (replay_mode == REPLAY_MODE_PLAY) { > > > > > while (!(req = block_request_find(id))) { > > > > > //aio_poll(bdrv_get_aio_context(req->bs), true); > > > > > usleep(1); > > > > > } > > > > > > > > How is this loop supposed to make any progress? > > > > > > This loop does not supposed to make any progress. It waits until block_request_insert > > > call is added to the queue. > > > > Yes. And who is supposed to add something to the queue? You are looping > > and doing nothing, so no other code gets to run in this thread. It's > > only aio_poll() that would run event handlers that could eventually add > > something to the list. > > blkreplay_co_readv adds request to the queue. I don't see blkreplay_co_readv() called in this loop without aio_poll(). > > > > And I still don't understand why aio_poll() doesn't work and where it > > > > hangs. > > > > > > aio_poll hangs if "req = block_request_insert(reqid, bs, qemu_coroutine_self());" line > > > is executed after bdrv_co_readv. When bdrv_co_readv yields, replay_run_block_event has no > > > information about pending request and cannot jump to its coroutine. > > > Maybe I should implement aio_poll execution there to make progress in that case? > > > > Up in the thread, I posted code that was a bit more complex than what > > you have, and which considered both cases (completion before > > replay/completion after replay). If you implement it like this, it might > > just work. And yes, it involved an aio_poll() loop in the replay > > function. > > You are right, but I found kind of fundamental problem here. > There are two possible approaches for replaying coroutine events: > > 1. Waiting with aio_poll in replay_run_block_event. > In this case replay cannot be made, because of recursive mutex lock: > aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> > replay_run_block_event -> aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> > <assert failed> > I.e. we have recursive aio_poll function calls that lead to recursive replay calls > and to recursive mutex lock. That's what you get for replaying stuff in qemu_clock_get_ns() when it should really be replayed in the hardware emulation. This is a problem you would also have had with your original patch, as there are some places in the block layer that use aio_poll() internally. Using the blkreplay driver only made you find the bug earlier. But... can you detail how we get from aio_poll() to qemu_clock_get_ns()? I don't see it directly calling that function, so it might just be a callback to some guest device that is running now. > 2. Adding block events to the queue until checkpoint is met. > This will not work with synchronous requests, because they will wait infinitely > and checkpoint will never happen. Correct me, if I'm not right for this case. > If blkreplay_co_readv will yield, can vcpu thread unlock the BQL and wait for > the checkpoint in iothread? I don't know enough checkpoints and about CPU emulation to give a definite answer. This kind of blocking the I/O thread feels wrong, though. Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 16.02.2016 um 07:25 hat Pavel Dovgalyuk geschrieben: > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > Am 15.02.2016 um 15:24 hat Pavel Dovgalyuk geschrieben: > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > Here is the current version of blkreplay functions: > > > > > > > > > > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, > > > > > > int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) > > > > > > { > > > > > > uint32_t reqid = request_id++; > > > > > > Request *req; > > > > > > req = block_request_insert(reqid, bs, qemu_coroutine_self()); > > > > > > bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); > > > > > > > > > > > > if (replay_mode == REPLAY_MODE_RECORD) { > > > > > > replay_save_block_event(reqid); > > > > > > } else { > > > > > > assert(replay_mode == REPLAY_MODE_PLAY); > > > > > > qemu_coroutine_yield(); > > > > > > } > > > > > > block_request_remove(req); > > > > > > > > > > > > return 0; > > > > > > } > > > > > > > > > > > > void replay_run_block_event(uint32_t id) > > > > > > { > > > > > > Request *req; > > > > > > if (replay_mode == REPLAY_MODE_PLAY) { > > > > > > while (!(req = block_request_find(id))) { > > > > > > //aio_poll(bdrv_get_aio_context(req->bs), true); > > > > > > usleep(1); > > > > > > } > > > > > > > > > > How is this loop supposed to make any progress? > > > > > > > > This loop does not supposed to make any progress. It waits until block_request_insert > > > > call is added to the queue. > > > > > > Yes. And who is supposed to add something to the queue? You are looping > > > and doing nothing, so no other code gets to run in this thread. It's > > > only aio_poll() that would run event handlers that could eventually add > > > something to the list. > > > > blkreplay_co_readv adds request to the queue. > > I don't see blkreplay_co_readv() called in this loop without aio_poll(). Is this mandatory? I thought that version which you proposed has a race condition at the following lines when AIO request is operated by a worker thread. Is this correct? Coroutine Replay bool *done = req_replayed_list_get(reqid) // NULL co = req_completed_list_get(e.reqid); // NULL req_completed_list_insert(reqid, qemu_coroutine_self()); qemu_coroutine_yield(); bool done = false; req_replayed_list_insert(reqid, &done); while (!done) { // nobody will change done anymore aio_poll(); } > > You are right, but I found kind of fundamental problem here. > > There are two possible approaches for replaying coroutine events: > > > > 1. Waiting with aio_poll in replay_run_block_event. > > In this case replay cannot be made, because of recursive mutex lock: > > aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> > > replay_run_block_event -> aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> > > <assert failed> > > I.e. we have recursive aio_poll function calls that lead to recursive replay calls > > and to recursive mutex lock. > > That's what you get for replaying stuff in qemu_clock_get_ns() when it > should really be replayed in the hardware emulation. I'm not sure that it is possible for RTC. Host clock should run even if VM is stopped. If we emulate host clock, its behavior should be somehow synchronized with the host. > This is a problem you would also have had with your original patch, as > there are some places in the block layer that use aio_poll() internally. > Using the blkreplay driver only made you find the bug earlier. Not exactly. Host clock replay will use cached clock value if there is no such event in the log. The problem I observe here is with asynchronous block events that should be processed immediately as they read from log. > But... can you detail how we get from aio_poll() to qemu_clock_get_ns()? > I don't see it directly calling that function, so it might just be a > callback to some guest device that is running now. aio_poll (win32) has the following line: timeout = blocking && !have_select_revents ? qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)) : 0; POSIX version also uses that function. aio_compute_timeout() requests time from all existing clocks by calling timerlistgroup_deadline_ns(). Pavel Dovgalyuk
Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 16.02.2016 um 07:25 hat Pavel Dovgalyuk geschrieben: > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > Am 15.02.2016 um 15:24 hat Pavel Dovgalyuk geschrieben: > > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > > > Here is the current version of blkreplay functions: > > > > > > > > > > > > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, > > > > > > > int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) > > > > > > > { > > > > > > > uint32_t reqid = request_id++; > > > > > > > Request *req; > > > > > > > req = block_request_insert(reqid, bs, qemu_coroutine_self()); > > > > > > > bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); > > > > > > > > > > > > > > if (replay_mode == REPLAY_MODE_RECORD) { > > > > > > > replay_save_block_event(reqid); > > > > > > > } else { > > > > > > > assert(replay_mode == REPLAY_MODE_PLAY); > > > > > > > qemu_coroutine_yield(); > > > > > > > } > > > > > > > block_request_remove(req); > > > > > > > > > > > > > > return 0; > > > > > > > } > > > > > > > > > > > > > > void replay_run_block_event(uint32_t id) > > > > > > > { > > > > > > > Request *req; > > > > > > > if (replay_mode == REPLAY_MODE_PLAY) { > > > > > > > while (!(req = block_request_find(id))) { > > > > > > > //aio_poll(bdrv_get_aio_context(req->bs), true); > > > > > > > usleep(1); > > > > > > > } > > > > > > > > > > > > How is this loop supposed to make any progress? > > > > > > > > > > This loop does not supposed to make any progress. It waits until block_request_insert > > > > > call is added to the queue. > > > > > > > > Yes. And who is supposed to add something to the queue? You are looping > > > > and doing nothing, so no other code gets to run in this thread. It's > > > > only aio_poll() that would run event handlers that could eventually add > > > > something to the list. > > > > > > blkreplay_co_readv adds request to the queue. > > > > I don't see blkreplay_co_readv() called in this loop without aio_poll(). > > Is this mandatory? If you don't want it to be an endless loop, someone needs to make the value of the loop condition change... > I thought that version which you proposed has a race condition at the following lines > when AIO request is operated by a worker thread. Is this correct? Unfortunately this ended up linewrapped, trying to restore what you meant: > Coroutine Replay > bool *done = req_replayed_list_get(reqid) // NULL > co = req_completed_list_get(e.reqid); // NULL There was no yield, this context switch is impossible to happen. Same for the switch back. > req_completed_list_insert(reqid, qemu_coroutine_self()); > qemu_coroutine_yield(); This is the point at which a context switch happens. The only other point in my code is the qemu_coroutine_enter() in the other function. > bool done = false; > req_replayed_list_insert(reqid, &done); > while (!done) { // nobody will change done anymore > aio_poll(); > } > > > > > You are right, but I found kind of fundamental problem here. > > > There are two possible approaches for replaying coroutine events: > > > > > > 1. Waiting with aio_poll in replay_run_block_event. > > > In this case replay cannot be made, because of recursive mutex lock: > > > aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> > > > replay_run_block_event -> aio_poll -> qemu_clock_get_ns -> <lock replay mutex> -> > > > <assert failed> > > > I.e. we have recursive aio_poll function calls that lead to recursive replay calls > > > and to recursive mutex lock. > > > > That's what you get for replaying stuff in qemu_clock_get_ns() when it > > should really be replayed in the hardware emulation. > > I'm not sure that it is possible for RTC. Host clock should run even if VM is stopped. > If we emulate host clock, its behavior should be somehow synchronized with the host. I think the RTC behaviour should be replayed, whereas the host clock should be the right one. Or maybe it would actually make most sense to forbid clock=host for record/replay, because in the replay you don't get the host clock anyway, but just saved values. > > This is a problem you would also have had with your original patch, as > > there are some places in the block layer that use aio_poll() internally. > > Using the blkreplay driver only made you find the bug earlier. > > Not exactly. Host clock replay will use cached clock value if there is no such event in the log. > The problem I observe here is with asynchronous block events that should be processed > immediately as they read from log. Doesn't your earlier patch do the same? I seem to remember that it calls bdrv_drain_all(), which is essentially the same as calling aio_poll() in a loop, except that its loop condition is slightly different. > > But... can you detail how we get from aio_poll() to qemu_clock_get_ns()? > > I don't see it directly calling that function, so it might just be a > > callback to some guest device that is running now. > > aio_poll (win32) has the following line: > > timeout = blocking && !have_select_revents > ? qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)) : 0; > > POSIX version also uses that function. aio_compute_timeout() requests time from > all existing clocks by calling timerlistgroup_deadline_ns(). I see, thanks. Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben: > > Coroutine Replay > > bool *done = req_replayed_list_get(reqid) // NULL > > co = > req_completed_list_get(e.reqid); // NULL > > There was no yield, this context switch is impossible to happen. Same > for the switch back. > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > qemu_coroutine_yield(); > > This is the point at which a context switch happens. The only other > point in my code is the qemu_coroutine_enter() in the other function. I've fixed aio_poll problem by disabling mutex lock for the replay_run_block_event() execution. Now virtual machine deterministically runs 4e8 instructions of Windows XP booting. But then one non-deterministic event happens. Callback after finishing coroutine may be called from different contexts. apic_update_irq() function behaves differently being called from vcpu and io threads. In one case it sets CPU_INTERRUPT_POLL and in other - nothing happens. Therefore execution becomes non-deterministic. In previous version of the patch I solved this problem by linking block events to the execution checkpoints. IO thread have its own checkpoints and vcpu - its own. Therefore apic callbacks are always called from the same thread in replay as in recording phase. Pavel Dovgalyuk
> From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru] > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben: > > > Coroutine Replay > > > bool *done = req_replayed_list_get(reqid) // NULL > > > co = > > req_completed_list_get(e.reqid); // NULL > > > > There was no yield, this context switch is impossible to happen. Same > > for the switch back. > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > qemu_coroutine_yield(); > > > > This is the point at which a context switch happens. The only other > > point in my code is the qemu_coroutine_enter() in the other function. > > I've fixed aio_poll problem by disabling mutex lock for the replay_run_block_event() > execution. Now virtual machine deterministically runs 4e8 instructions of Windows XP booting. > > But then one non-deterministic event happens. > Callback after finishing coroutine may be called from different contexts. > apic_update_irq() function behaves differently being called from vcpu and io threads. > In one case it sets CPU_INTERRUPT_POLL and in other - nothing happens. Kevin, do you have some ideas how to fix this issue? This happens because of coroutines may be assigned to different threads. Maybe there is some way of making this assignment more deterministic? > Therefore execution becomes non-deterministic. > In previous version of the patch I solved this problem by linking block events to the > execution checkpoints. IO thread have its own checkpoints and vcpu - its own. > Therefore apic callbacks are always called from the same thread in replay as in recording > phase. Pavel Dovgalyuk
Am 20.02.2016 um 08:11 hat Pavel Dovgalyuk geschrieben: > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru] > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben: > > > > Coroutine Replay > > > > bool *done = req_replayed_list_get(reqid) // NULL > > > > co = > > > req_completed_list_get(e.reqid); // NULL > > > > > > There was no yield, this context switch is impossible to happen. Same > > > for the switch back. > > > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > qemu_coroutine_yield(); > > > > > > This is the point at which a context switch happens. The only other > > > point in my code is the qemu_coroutine_enter() in the other function. > > > > I've fixed aio_poll problem by disabling mutex lock for the replay_run_block_event() > > execution. Now virtual machine deterministically runs 4e8 instructions of Windows XP booting. Are you sure that the lock was unnecessary? Solving deadlocks by removing the lock is a rather adventurous method. I assume that you're still replaying events from low-level functions like qemu_clock_get_ns(). So even if you get rid of the hangs, the result is probably not quite right. I'm afraid this is going in a direction where my comments can't be more constructive than a simple "you're doing it wrong". > > But then one non-deterministic event happens. > > Callback after finishing coroutine may be called from different contexts. How does this happen? I'm not aware of callbacks being processed by any thread other than the I/O thread for that specific block device (unless you use dataplane, this is the main loop thread). > > apic_update_irq() function behaves differently being called from vcpu and io threads. > > In one case it sets CPU_INTERRUPT_POLL and in other - nothing happens. > > Kevin, do you have some ideas how to fix this issue? > This happens because of coroutines may be assigned to different threads. > Maybe there is some way of making this assignment more deterministic? Coroutines aren't randomly assigned to threads, but threads actively enter coroutines. To my knowledge this happens only when starting a request (either vcpu or I/O thread; consistent per device) or by a callback when some event happens (only I/O thread). I can't see any non-determinism here. Kevin > > Therefore execution becomes non-deterministic. > > In previous version of the patch I solved this problem by linking block events to the > > execution checkpoints. IO thread have its own checkpoints and vcpu - its own. > > Therefore apic callbacks are always called from the same thread in replay as in recording > > phase. > > Pavel Dovgalyuk >
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 20.02.2016 um 08:11 hat Pavel Dovgalyuk geschrieben: > > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru] > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben: > > > > > Coroutine Replay > > > > > bool *done = req_replayed_list_get(reqid) // NULL > > > > > co = > > > > req_completed_list_get(e.reqid); // NULL > > > > > > > > There was no yield, this context switch is impossible to happen. Same > > > > for the switch back. > > > > > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > > qemu_coroutine_yield(); > > > > > > > > This is the point at which a context switch happens. The only other > > > > point in my code is the qemu_coroutine_enter() in the other function. > > > > > > I've fixed aio_poll problem by disabling mutex lock for the replay_run_block_event() > > > execution. Now virtual machine deterministically runs 4e8 instructions of Windows XP > booting. > > > But then one non-deterministic event happens. > > > Callback after finishing coroutine may be called from different contexts. > > How does this happen? I'm not aware of callbacks being processed by any > thread other than the I/O thread for that specific block device (unless > you use dataplane, this is the main loop thread). > > > > apic_update_irq() function behaves differently being called from vcpu and io threads. > > > In one case it sets CPU_INTERRUPT_POLL and in other - nothing happens. > > > > Kevin, do you have some ideas how to fix this issue? > > This happens because of coroutines may be assigned to different threads. > > Maybe there is some way of making this assignment more deterministic? > > Coroutines aren't randomly assigned to threads, but threads actively > enter coroutines. To my knowledge this happens only when starting a > request (either vcpu or I/O thread; consistent per device) or by a > callback when some event happens (only I/O thread). I can't see any > non-determinism here. Behavior of coroutines looks strange for me. Consider the code below (co_readv function of the replay driver). In record mode it somehow changes the thread it assigned to. Code in point A is executed in CPU thread and code in point B - in some other thread. May this happen because this coroutine yields somewhere and its execution is restored by aio_poll, which is called from iothread? In this case event finishing callback cannot be executed deterministically (always in CPU thread or always in IO thread). static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) { BDRVBlkreplayState *s = bs->opaque; uint32_t reqid = request_id++; Request *req; // A bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); if (replay_mode == REPLAY_MODE_RECORD) { replay_save_block_event(reqid); } else { assert(replay_mode == REPLAY_MODE_PLAY); if (reqid == current_request) { current_finished = true; } else { req = block_request_insert(reqid, bs, qemu_coroutine_self()); qemu_coroutine_yield(); block_request_remove(req); } } // B return 0; } Pavel Dovgalyuk
Am 24.02.2016 um 12:59 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 20.02.2016 um 08:11 hat Pavel Dovgalyuk geschrieben: > > > > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru] > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > > Am 16.02.2016 um 12:20 hat Pavel Dovgalyuk geschrieben: > > > > > > Coroutine Replay > > > > > > bool *done = req_replayed_list_get(reqid) // NULL > > > > > > co = > > > > > req_completed_list_get(e.reqid); // NULL > > > > > > > > > > There was no yield, this context switch is impossible to happen. Same > > > > > for the switch back. > > > > > > > > > > > req_completed_list_insert(reqid, qemu_coroutine_self()); > > > > > > qemu_coroutine_yield(); > > > > > > > > > > This is the point at which a context switch happens. The only other > > > > > point in my code is the qemu_coroutine_enter() in the other function. > > > > > > > > I've fixed aio_poll problem by disabling mutex lock for the replay_run_block_event() > > > > execution. Now virtual machine deterministically runs 4e8 instructions of Windows XP > > booting. > > > > But then one non-deterministic event happens. > > > > Callback after finishing coroutine may be called from different contexts. > > > > How does this happen? I'm not aware of callbacks being processed by any > > thread other than the I/O thread for that specific block device (unless > > you use dataplane, this is the main loop thread). > > > > > > apic_update_irq() function behaves differently being called from vcpu and io threads. > > > > In one case it sets CPU_INTERRUPT_POLL and in other - nothing happens. > > > > > > Kevin, do you have some ideas how to fix this issue? > > > This happens because of coroutines may be assigned to different threads. > > > Maybe there is some way of making this assignment more deterministic? > > > > Coroutines aren't randomly assigned to threads, but threads actively > > enter coroutines. To my knowledge this happens only when starting a > > request (either vcpu or I/O thread; consistent per device) or by a > > callback when some event happens (only I/O thread). I can't see any > > non-determinism here. > > Behavior of coroutines looks strange for me. > Consider the code below (co_readv function of the replay driver). > In record mode it somehow changes the thread it assigned to. > Code in point A is executed in CPU thread and code in point B - in some other thread. > May this happen because this coroutine yields somewhere and its execution is restored > by aio_poll, which is called from iothread? > In this case event finishing callback cannot be executed deterministically > (always in CPU thread or always in IO thread). > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, > int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) > { > BDRVBlkreplayState *s = bs->opaque; > uint32_t reqid = request_id++; > Request *req; > // A > bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); > > if (replay_mode == REPLAY_MODE_RECORD) { > replay_save_block_event(reqid); > } else { > assert(replay_mode == REPLAY_MODE_PLAY); > if (reqid == current_request) { > current_finished = true; > } else { > req = block_request_insert(reqid, bs, qemu_coroutine_self()); > qemu_coroutine_yield(); > block_request_remove(req); > } > } > // B > return 0; > } Yes, I guess this can happen. As I described above, the coroutine can be entered from a vcpu thread initially. After yielding for the first time, it is resumed from the I/O thread. So if there are paths where the coroutine never yields, the coroutine completes in the original vcpu thread. (It's not the common case that bdrv_co_readv() doesn't yield, but it happens e.g. with unallocated sectors in qcow2.) If this is a problem for you, you need to force the coroutine into the I/O thread. You can do that by scheduling a BH, then yield, and then let the BH reenter the coroutine. Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > > > Coroutines aren't randomly assigned to threads, but threads actively > > > enter coroutines. To my knowledge this happens only when starting a > > > request (either vcpu or I/O thread; consistent per device) or by a > > > callback when some event happens (only I/O thread). I can't see any > > > non-determinism here. > > > > Behavior of coroutines looks strange for me. > > Consider the code below (co_readv function of the replay driver). > > In record mode it somehow changes the thread it assigned to. > > Code in point A is executed in CPU thread and code in point B - in some other thread. > > May this happen because this coroutine yields somewhere and its execution is restored > > by aio_poll, which is called from iothread? > > In this case event finishing callback cannot be executed deterministically > > (always in CPU thread or always in IO thread). > > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, > > int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) > > { > > BDRVBlkreplayState *s = bs->opaque; > > uint32_t reqid = request_id++; > > Request *req; > > // A > > bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); > > > > if (replay_mode == REPLAY_MODE_RECORD) { > > replay_save_block_event(reqid); > > } else { > > assert(replay_mode == REPLAY_MODE_PLAY); > > if (reqid == current_request) { > > current_finished = true; > > } else { > > req = block_request_insert(reqid, bs, qemu_coroutine_self()); > > qemu_coroutine_yield(); > > block_request_remove(req); > > } > > } > > // B > > return 0; > > } > > Yes, I guess this can happen. As I described above, the coroutine can be > entered from a vcpu thread initially. After yielding for the first time, > it is resumed from the I/O thread. So if there are paths where the > coroutine never yields, the coroutine completes in the original vcpu > thread. (It's not the common case that bdrv_co_readv() doesn't yield, > but it happens e.g. with unallocated sectors in qcow2.) > > If this is a problem for you, you need to force the coroutine into the > I/O thread. You can do that by scheduling a BH, then yield, and then let > the BH reenter the coroutine. Thanks, this approach seems to work. I got rid of replay_run_block_event, because BH basically does the same job. There is one problem with flush event - callbacks for flush are called for all layers and I couldn't synchronize them correctly yet. I'll probably have to add new callback to block driver, which handles flush request for the whole stack of the drivers. Pavel Dovgalyuk
Am 25.02.2016 um 10:06 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > > Coroutines aren't randomly assigned to threads, but threads actively > > > > enter coroutines. To my knowledge this happens only when starting a > > > > request (either vcpu or I/O thread; consistent per device) or by a > > > > callback when some event happens (only I/O thread). I can't see any > > > > non-determinism here. > > > > > > Behavior of coroutines looks strange for me. > > > Consider the code below (co_readv function of the replay driver). > > > In record mode it somehow changes the thread it assigned to. > > > Code in point A is executed in CPU thread and code in point B - in some other thread. > > > May this happen because this coroutine yields somewhere and its execution is restored > > > by aio_poll, which is called from iothread? > > > In this case event finishing callback cannot be executed deterministically > > > (always in CPU thread or always in IO thread). > > > > > > static int coroutine_fn blkreplay_co_readv(BlockDriverState *bs, > > > int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) > > > { > > > BDRVBlkreplayState *s = bs->opaque; > > > uint32_t reqid = request_id++; > > > Request *req; > > > // A > > > bdrv_co_readv(bs->file->bs, sector_num, nb_sectors, qiov); > > > > > > if (replay_mode == REPLAY_MODE_RECORD) { > > > replay_save_block_event(reqid); > > > } else { > > > assert(replay_mode == REPLAY_MODE_PLAY); > > > if (reqid == current_request) { > > > current_finished = true; > > > } else { > > > req = block_request_insert(reqid, bs, qemu_coroutine_self()); > > > qemu_coroutine_yield(); > > > block_request_remove(req); > > > } > > > } > > > // B > > > return 0; > > > } > > > > Yes, I guess this can happen. As I described above, the coroutine can be > > entered from a vcpu thread initially. After yielding for the first time, > > it is resumed from the I/O thread. So if there are paths where the > > coroutine never yields, the coroutine completes in the original vcpu > > thread. (It's not the common case that bdrv_co_readv() doesn't yield, > > but it happens e.g. with unallocated sectors in qcow2.) > > > > If this is a problem for you, you need to force the coroutine into the > > I/O thread. You can do that by scheduling a BH, then yield, and then let > > the BH reenter the coroutine. > > Thanks, this approach seems to work. I got rid of replay_run_block_event, > because BH basically does the same job. > > There is one problem with flush event - callbacks for flush are called for > all layers and I couldn't synchronize them correctly yet. > I'll probably have to add new callback to block driver, which handles > flush request for the whole stack of the drivers. Flushes should be treated more or less the same a writes, I think. Kevin
diff --git a/block/block-backend.c b/block/block-backend.c index f41d326..d8820b9 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -16,6 +16,7 @@ #include "block/throttle-groups.h" #include "sysemu/blockdev.h" #include "sysemu/sysemu.h" +#include "sysemu/replay.h" #include "qapi-event.h" /* Number of coroutines to reserve per attached device model */ @@ -655,14 +656,14 @@ BlockAIOCB *blk_abort_aio_request(BlockBackend *blk, bh = aio_bh_new(blk_get_aio_context(blk), error_callback_bh, acb); acb->bh = bh; - qemu_bh_schedule(bh); + replay_bh_schedule_event(bh); return &acb->common; } -BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, - int nb_sectors, BdrvRequestFlags flags, - BlockCompletionFunc *cb, void *opaque) +BlockAIOCB *blk_aio_write_zeroes_impl(BlockBackend *blk, int64_t sector_num, + int nb_sectors, BdrvRequestFlags flags, + BlockCompletionFunc *cb, void *opaque) { int ret = blk_check_request(blk, sector_num, nb_sectors); if (ret < 0) { @@ -673,6 +674,13 @@ BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, cb, opaque); } +BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, + int nb_sectors, BdrvRequestFlags flags, + BlockCompletionFunc *cb, void *opaque) +{ + return replay_aio_write_zeroes(blk, sector_num, nb_sectors, flags, cb, opaque); +} + int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count) { int ret = blk_check_byte_request(blk, offset, count); @@ -720,7 +728,7 @@ int64_t blk_nb_sectors(BlockBackend *blk) return bdrv_nb_sectors(blk->bs); } -BlockAIOCB *blk_aio_readv(BlockBackend *blk, int64_t sector_num, +BlockAIOCB *blk_aio_readv_impl(BlockBackend *blk, int64_t sector_num, QEMUIOVector *iov, int nb_sectors, BlockCompletionFunc *cb, void *opaque) { @@ -732,9 +740,16 @@ BlockAIOCB *blk_aio_readv(BlockBackend *blk, int64_t sector_num, return bdrv_aio_readv(blk->bs, sector_num, iov, nb_sectors, cb, opaque); } -BlockAIOCB *blk_aio_writev(BlockBackend *blk, int64_t sector_num, - QEMUIOVector *iov, int nb_sectors, - BlockCompletionFunc *cb, void *opaque) +BlockAIOCB *blk_aio_readv(BlockBackend *blk, int64_t sector_num, + QEMUIOVector *iov, int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + return replay_aio_readv(blk, sector_num, iov, nb_sectors, cb, opaque); +} + +BlockAIOCB *blk_aio_writev_impl(BlockBackend *blk, int64_t sector_num, + QEMUIOVector *iov, int nb_sectors, + BlockCompletionFunc *cb, void *opaque) { int ret = blk_check_request(blk, sector_num, nb_sectors); if (ret < 0) { @@ -744,8 +759,15 @@ BlockAIOCB *blk_aio_writev(BlockBackend *blk, int64_t sector_num, return bdrv_aio_writev(blk->bs, sector_num, iov, nb_sectors, cb, opaque); } -BlockAIOCB *blk_aio_flush(BlockBackend *blk, - BlockCompletionFunc *cb, void *opaque) +BlockAIOCB *blk_aio_writev(BlockBackend *blk, int64_t sector_num, + QEMUIOVector *iov, int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + return replay_aio_writev(blk, sector_num, iov, nb_sectors, cb, opaque); +} + +BlockAIOCB *blk_aio_flush_impl(BlockBackend *blk, + BlockCompletionFunc *cb, void *opaque) { if (!blk_is_available(blk)) { return blk_abort_aio_request(blk, cb, opaque, -ENOMEDIUM); @@ -754,9 +776,15 @@ BlockAIOCB *blk_aio_flush(BlockBackend *blk, return bdrv_aio_flush(blk->bs, cb, opaque); } -BlockAIOCB *blk_aio_discard(BlockBackend *blk, - int64_t sector_num, int nb_sectors, - BlockCompletionFunc *cb, void *opaque) +BlockAIOCB *blk_aio_flush(BlockBackend *blk, + BlockCompletionFunc *cb, void *opaque) +{ + return replay_aio_flush(blk, cb, opaque); +} + +BlockAIOCB *blk_aio_discard_impl(BlockBackend *blk, + int64_t sector_num, int nb_sectors, + BlockCompletionFunc *cb, void *opaque) { int ret = blk_check_request(blk, sector_num, nb_sectors); if (ret < 0) { @@ -766,6 +794,13 @@ BlockAIOCB *blk_aio_discard(BlockBackend *blk, return bdrv_aio_discard(blk->bs, sector_num, nb_sectors, cb, opaque); } +BlockAIOCB *blk_aio_discard(BlockBackend *blk, + int64_t sector_num, int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + return replay_aio_discard(blk, sector_num, nb_sectors, cb, opaque); +} + void blk_aio_cancel(BlockAIOCB *acb) { bdrv_aio_cancel(acb); @@ -799,8 +834,8 @@ int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf) return bdrv_ioctl(blk->bs, req, buf); } -BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf, - BlockCompletionFunc *cb, void *opaque) +BlockAIOCB *blk_aio_ioctl_impl(BlockBackend *blk, unsigned long int req, void *buf, + BlockCompletionFunc *cb, void *opaque) { if (!blk_is_available(blk)) { return blk_abort_aio_request(blk, cb, opaque, -ENOMEDIUM); @@ -809,6 +844,12 @@ BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf, return bdrv_aio_ioctl(blk->bs, req, buf, cb, opaque); } +BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf, + BlockCompletionFunc *cb, void *opaque) +{ + return replay_aio_ioctl(blk, req, buf, cb, opaque); +} + int blk_co_discard(BlockBackend *blk, int64_t sector_num, int nb_sectors) { int ret = blk_check_request(blk, sector_num, nb_sectors); diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h index dc24476..cf05d56 100644 --- a/include/sysemu/block-backend.h +++ b/include/sysemu/block-backend.h @@ -100,6 +100,9 @@ int blk_write_zeroes(BlockBackend *blk, int64_t sector_num, BlockAIOCB *blk_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, int nb_sectors, BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque); +BlockAIOCB *blk_aio_write_zeroes_impl(BlockBackend *blk, int64_t sector_num, + int nb_sectors, BdrvRequestFlags flags, + BlockCompletionFunc *cb, void *opaque); int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count); int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int count); int64_t blk_getlength(BlockBackend *blk); @@ -108,20 +111,33 @@ int64_t blk_nb_sectors(BlockBackend *blk); BlockAIOCB *blk_aio_readv(BlockBackend *blk, int64_t sector_num, QEMUIOVector *iov, int nb_sectors, BlockCompletionFunc *cb, void *opaque); +BlockAIOCB *blk_aio_readv_impl(BlockBackend *blk, int64_t sector_num, + QEMUIOVector *iov, int nb_sectors, + BlockCompletionFunc *cb, void *opaque); BlockAIOCB *blk_aio_writev(BlockBackend *blk, int64_t sector_num, QEMUIOVector *iov, int nb_sectors, BlockCompletionFunc *cb, void *opaque); +BlockAIOCB *blk_aio_writev_impl(BlockBackend *blk, int64_t sector_num, + QEMUIOVector *iov, int nb_sectors, + BlockCompletionFunc *cb, void *opaque); BlockAIOCB *blk_aio_flush(BlockBackend *blk, BlockCompletionFunc *cb, void *opaque); +BlockAIOCB *blk_aio_flush_impl(BlockBackend *blk, + BlockCompletionFunc *cb, void *opaque); BlockAIOCB *blk_aio_discard(BlockBackend *blk, int64_t sector_num, int nb_sectors, BlockCompletionFunc *cb, void *opaque); +BlockAIOCB *blk_aio_discard_impl(BlockBackend *blk, + int64_t sector_num, int nb_sectors, + BlockCompletionFunc *cb, void *opaque); void blk_aio_cancel(BlockAIOCB *acb); void blk_aio_cancel_async(BlockAIOCB *acb); int blk_aio_multiwrite(BlockBackend *blk, BlockRequest *reqs, int num_reqs); int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf); BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf, BlockCompletionFunc *cb, void *opaque); +BlockAIOCB *blk_aio_ioctl_impl(BlockBackend *blk, unsigned long int req, void *buf, + BlockCompletionFunc *cb, void *opaque); int blk_co_discard(BlockBackend *blk, int64_t sector_num, int nb_sectors); int blk_co_flush(BlockBackend *blk); int blk_flush(BlockBackend *blk); diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h index c879231..c06e355 100644 --- a/include/sysemu/replay.h +++ b/include/sysemu/replay.h @@ -17,6 +17,8 @@ #include "qapi-types.h" #include "qapi/error.h" #include "qemu/typedefs.h" +#include "block/aio.h" +#include "block/block.h" /* replay clock kinds */ enum ReplayClockKind { @@ -125,6 +127,31 @@ void replay_register_char_driver(struct CharDriverState *chr); /*! Saves write to char device event to the log */ void replay_chr_be_write(struct CharDriverState *s, uint8_t *buf, int len); +/* Block devices */ + +/*! Asynchronous read function. Adds read request to the queue. */ +BlockAIOCB *replay_aio_readv(BlockBackend *blk, int64_t sector_num, + QEMUIOVector *iov, int nb_sectors, + BlockCompletionFunc *cb, void *opaque); +/*! Asynchronous write function. Adds write request to the queue. */ +BlockAIOCB *replay_aio_writev(BlockBackend *blk, int64_t sector_num, + QEMUIOVector *iov, int nb_sectors, + BlockCompletionFunc *cb, void *opaque); +/*! Asynchronous write zeroes function. Adds this request to the queue. */ +BlockAIOCB *replay_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, + int nb_sectors, BdrvRequestFlags flags, + BlockCompletionFunc *cb, void *opaque); +/*! Asynchronous flush function. Adds flush request to the queue. */ +BlockAIOCB *replay_aio_flush(BlockBackend *blk, + BlockCompletionFunc *cb, void *opaque); +/*! Asynchronous discard function. Adds discard request to the queue. */ +BlockAIOCB *replay_aio_discard(BlockBackend *blk, int64_t sector_num, + int nb_sectors, BlockCompletionFunc *cb, + void *opaque); +/*! Asynchronous ioctl function. Adds ioctl request to the queue. */ +BlockAIOCB *replay_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf, + BlockCompletionFunc *cb, void *opaque); + /* Other data */ /*! Writes or reads integer value to/from replay log. */ diff --git a/replay/Makefile.objs b/replay/Makefile.objs index 70e5572..050c0ea 100644 --- a/replay/Makefile.objs +++ b/replay/Makefile.objs @@ -4,3 +4,4 @@ common-obj-y += replay-events.o common-obj-y += replay-time.o common-obj-y += replay-input.o common-obj-y += replay-char.o +common-obj-y += replay-block.o diff --git a/replay/replay-block.c b/replay/replay-block.c new file mode 100755 index 0000000..7fc68cf --- /dev/null +++ b/replay/replay-block.c @@ -0,0 +1,202 @@ +/* + * replay-block.c + * + * Copyright (c) 2010-2016 Institute for System Programming + * of the Russian Academy of Sciences. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ +#include "sysemu/replay.h" +#include "sysemu/block-backend.h" +#include "replay-internal.h" + +static void replay_aio_cancel(BlockAIOCB *acb); + +typedef struct ReplayAIOCB { + BlockAIOCB common; + BlockRequest req; + BlockBackend *blk; + ReplayAsyncEventKind kind; +} ReplayAIOCB; + +static const AIOCBInfo replay_aiocb_info = { + .aiocb_size = sizeof(ReplayAIOCB), + .cancel_async = replay_aio_cancel +}; + +static ReplayAIOCB *replay_aio_create(ReplayAsyncEventKind kind, + BlockBackend *blk, + BlockCompletionFunc *cb, void *opaque) +{ + ReplayAIOCB *acb = g_malloc0(sizeof(*acb)); + acb->kind = kind; + /* Make correct AIOCB to return to the caller */ + acb->common.aiocb_info = &replay_aiocb_info; + acb->common.bs = blk_bs(blk); + acb->common.cb = cb; + acb->common.opaque = opaque; + acb->common.refcnt = 1; + /* Data to restore and pass request to the bdrv */ + acb->blk = blk; + acb->req.error = -EINPROGRESS; + return acb; +} + +BlockAIOCB *replay_aio_readv(BlockBackend *blk, int64_t sector_num, + QEMUIOVector *iov, int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + if (replay_mode == REPLAY_MODE_NONE) { + return blk_aio_readv_impl(blk, sector_num, iov, nb_sectors, cb, opaque); + } else { + ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_READV, + blk, cb, opaque); + acb->req.sector = sector_num; + acb->req.nb_sectors = nb_sectors; + acb->req.qiov = iov; + replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_READV, acb, NULL, 0); + + return &acb->common; + } +} + +void replay_event_block_readv_run(void *opaque) +{ + ReplayAIOCB *acb = opaque; + blk_aio_readv_impl(acb->blk, acb->req.sector, acb->req.qiov, + acb->req.nb_sectors, acb->common.cb, + acb->common.opaque); + blk_drain_all(); +} + +BlockAIOCB *replay_aio_writev(BlockBackend *blk, int64_t sector_num, + QEMUIOVector *iov, int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + if (replay_mode == REPLAY_MODE_NONE) { + return blk_aio_writev_impl(blk, sector_num, iov, nb_sectors, cb, opaque); + } else { + ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_WRITEV, + blk, cb, opaque); + acb->req.sector = sector_num; + acb->req.nb_sectors = nb_sectors; + acb->req.qiov = iov; + replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_WRITEV, acb, NULL, 0); + + return &acb->common; + } +} + +void replay_event_block_writev_run(void *opaque) +{ + ReplayAIOCB *acb = opaque; + blk_aio_writev_impl(acb->blk, acb->req.sector, acb->req.qiov, + acb->req.nb_sectors, acb->common.cb, + acb->common.opaque); + blk_drain_all(); +} + +BlockAIOCB *replay_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, + int nb_sectors, BdrvRequestFlags flags, + BlockCompletionFunc *cb, void *opaque) +{ + if (replay_mode == REPLAY_MODE_NONE) { + return blk_aio_write_zeroes_impl(blk, sector_num, nb_sectors, flags, cb, opaque); + } else { + ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES, + blk, cb, opaque); + acb->req.sector = sector_num; + acb->req.nb_sectors = nb_sectors; + acb->req.flags = flags; + replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES, acb, NULL, 0); + + return &acb->common; + } +} + +void replay_event_block_write_zeroes_run(void *opaque) +{ + ReplayAIOCB *acb = opaque; + blk_aio_write_zeroes_impl(acb->blk, acb->req.sector, acb->req.nb_sectors, + acb->req.flags, acb->common.cb, + acb->common.opaque); + blk_drain_all(); +} + +BlockAIOCB *replay_aio_flush(BlockBackend *blk, + BlockCompletionFunc *cb, void *opaque) +{ + if (replay_mode == REPLAY_MODE_NONE) { + return blk_aio_flush_impl(blk, cb, opaque); + } else { + ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_FLUSH, + blk, cb, opaque); + replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_FLUSH, acb, NULL, 0); + + return &acb->common; + } +} + +void replay_event_block_flush_run(void *opaque) +{ + ReplayAIOCB *acb = opaque; + blk_aio_flush_impl(acb->blk, acb->common.cb, acb->common.opaque); + blk_drain_all(); +} + +static void replay_aio_cancel(BlockAIOCB *acb) +{ + ReplayAIOCB *racb = container_of(acb, ReplayAIOCB, common);; + replay_remove_event(racb->kind, acb, NULL, 0); +} + +BlockAIOCB *replay_aio_discard(BlockBackend *blk, int64_t sector_num, + int nb_sectors, BlockCompletionFunc *cb, + void *opaque) +{ + if (replay_mode == REPLAY_MODE_NONE) { + return blk_aio_discard_impl(blk, sector_num, nb_sectors, cb, opaque); + } else { + ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_DISCARD, + blk, cb, opaque); + acb->req.sector = sector_num; + acb->req.nb_sectors = nb_sectors; + replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_DISCARD, acb, NULL, 0); + + return &acb->common; + } +} + +void replay_event_block_discard_run(void *opaque) +{ + ReplayAIOCB *acb = opaque; + blk_aio_discard_impl(acb->blk, acb->req.sector, acb->req.nb_sectors, + acb->common.cb, acb->common.opaque); + blk_drain_all(); +} + +BlockAIOCB *replay_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf, + BlockCompletionFunc *cb, void *opaque) +{ + if (replay_mode == REPLAY_MODE_NONE) { + return blk_aio_ioctl_impl(blk, req, buf, cb, opaque); + } else { + ReplayAIOCB *acb = replay_aio_create(REPLAY_ASYNC_EVENT_BLOCK_IOCTL, + blk, cb, opaque); + acb->req.req = req; + acb->req.buf = buf; + replay_add_event(REPLAY_ASYNC_EVENT_BLOCK_IOCTL, acb, NULL, 0); + + return &acb->common; + } +} + +void replay_event_block_ioctl_run(void *opaque) +{ + ReplayAIOCB *acb = opaque; + blk_aio_ioctl_impl(acb->blk, acb->req.req, acb->req.buf, + acb->common.cb, acb->common.opaque); + blk_drain_all(); +} diff --git a/replay/replay-events.c b/replay/replay-events.c index 7fc7ed0..1551257 100644 --- a/replay/replay-events.c +++ b/replay/replay-events.c @@ -50,6 +50,24 @@ static void replay_run_event(Event *event) case REPLAY_ASYNC_EVENT_CHAR: replay_event_char_run(event->opaque); break; + case REPLAY_ASYNC_EVENT_BLOCK_READV: + replay_event_block_readv_run(event->opaque); + break; + case REPLAY_ASYNC_EVENT_BLOCK_WRITEV: + replay_event_block_writev_run(event->opaque); + break; + case REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES: + replay_event_block_write_zeroes_run(event->opaque); + break; + case REPLAY_ASYNC_EVENT_BLOCK_FLUSH: + replay_event_block_flush_run(event->opaque); + break; + case REPLAY_ASYNC_EVENT_BLOCK_DISCARD: + replay_event_block_discard_run(event->opaque); + break; + case REPLAY_ASYNC_EVENT_BLOCK_IOCTL: + replay_event_block_ioctl_run(event->opaque); + break; default: error_report("Replay: invalid async event ID (%d) in the queue", event->event_kind); @@ -132,6 +150,24 @@ void replay_add_event(ReplayAsyncEventKind event_kind, replay_mutex_unlock(); } +void replay_remove_event(ReplayAsyncEventKind event_kind, + void *opaque, + void *opaque2, uint64_t id) +{ + Event *event; + replay_mutex_lock(); + QTAILQ_FOREACH(event, &events_list, events) { + if (event->event_kind == event_kind + && event->opaque == opaque + && event->opaque2 == opaque2 + && event->id == id) { + QTAILQ_REMOVE(&events_list, event, events); + break; + } + } + replay_mutex_unlock(); +} + void replay_bh_schedule_event(QEMUBH *bh) { if (replay_mode != REPLAY_MODE_NONE) { @@ -173,6 +209,13 @@ static void replay_save_event(Event *event, int checkpoint) case REPLAY_ASYNC_EVENT_CHAR: replay_event_char_save(event->opaque); break; + case REPLAY_ASYNC_EVENT_BLOCK_READV: + case REPLAY_ASYNC_EVENT_BLOCK_WRITEV: + case REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES: + case REPLAY_ASYNC_EVENT_BLOCK_FLUSH: + case REPLAY_ASYNC_EVENT_BLOCK_DISCARD: + case REPLAY_ASYNC_EVENT_BLOCK_IOCTL: + break; default: error_report("Unknown ID %d of replay event", read_event_kind); exit(1); @@ -231,6 +274,14 @@ static Event *replay_read_event(int checkpoint) event->event_kind = read_event_kind; event->opaque = replay_event_char_read(); return event; + case REPLAY_ASYNC_EVENT_BLOCK_READV: + case REPLAY_ASYNC_EVENT_BLOCK_WRITEV: + case REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES: + case REPLAY_ASYNC_EVENT_BLOCK_FLUSH: + case REPLAY_ASYNC_EVENT_BLOCK_DISCARD: + case REPLAY_ASYNC_EVENT_BLOCK_IOCTL: + /* search in the queue */ + break; default: error_report("Unknown ID %d of replay event", read_event_kind); exit(1); diff --git a/replay/replay-internal.h b/replay/replay-internal.h index 34eee5b..8f705d1 100644 --- a/replay/replay-internal.h +++ b/replay/replay-internal.h @@ -47,6 +47,12 @@ enum ReplayAsyncEventKind { REPLAY_ASYNC_EVENT_INPUT, REPLAY_ASYNC_EVENT_INPUT_SYNC, REPLAY_ASYNC_EVENT_CHAR, + REPLAY_ASYNC_EVENT_BLOCK_READV, + REPLAY_ASYNC_EVENT_BLOCK_WRITEV, + REPLAY_ASYNC_EVENT_BLOCK_WRITE_ZEROES, + REPLAY_ASYNC_EVENT_BLOCK_FLUSH, + REPLAY_ASYNC_EVENT_BLOCK_DISCARD, + REPLAY_ASYNC_EVENT_BLOCK_IOCTL, REPLAY_ASYNC_COUNT }; @@ -131,6 +137,9 @@ void replay_read_events(int checkpoint); /*! Adds specified async event to the queue */ void replay_add_event(ReplayAsyncEventKind event_kind, void *opaque, void *opaque2, uint64_t id); +/*! Removes specified async event from the queue */ +void replay_remove_event(ReplayAsyncEventKind event_kind, + void *opaque, void *opaque2, uint64_t id); /* Input events */ @@ -152,4 +161,19 @@ void replay_event_char_save(void *opaque); /*! Reads char event from the file. */ void *replay_event_char_read(void); +/* Block devices */ + +/*! Called to run block device readv event. */ +void replay_event_block_readv_run(void *opaque); +/*! Called to run block device writev event. */ +void replay_event_block_writev_run(void *opaque); +/*! Called to run block device write zeroes event. */ +void replay_event_block_write_zeroes_run(void *opaque); +/*! Called to run block device flush event. */ +void replay_event_block_flush_run(void *opaque); +/*! Called to run block device discard event. */ +void replay_event_block_discard_run(void *opaque); +/*! Called to run block device ioctl event. */ +void replay_event_block_ioctl_run(void *opaque); + #endif diff --git a/stubs/replay.c b/stubs/replay.c index 8f98790..54a5f61 100644 --- a/stubs/replay.c +++ b/stubs/replay.c @@ -1,6 +1,7 @@ #include "sysemu/replay.h" #include <stdlib.h> #include "sysemu/sysemu.h" +#include "sysemu/block-backend.h" ReplayMode replay_mode; @@ -29,3 +30,48 @@ bool replay_events_enabled(void) void replay_finish(void) { } + +BlockAIOCB *replay_aio_readv(BlockBackend *blk, int64_t sector_num, + QEMUIOVector *iov, int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + return blk_aio_readv_impl(blk, sector_num, iov, nb_sectors, cb, opaque); +} + +BlockAIOCB *replay_aio_writev(BlockBackend *blk, int64_t sector_num, + QEMUIOVector *iov, int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + return blk_aio_writev_impl(blk, sector_num, iov, nb_sectors, cb, opaque); +} + +BlockAIOCB *replay_aio_flush(BlockBackend *blk, + BlockCompletionFunc *cb, void *opaque) +{ + return blk_aio_flush_impl(blk, cb, opaque); +} + +BlockAIOCB *replay_aio_discard(BlockBackend *blk, + int64_t sector_num, int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + return blk_aio_discard_impl(blk, sector_num, nb_sectors, cb, opaque); +} + +BlockAIOCB *replay_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf, + BlockCompletionFunc *cb, void *opaque) +{ + return blk_aio_ioctl_impl(blk, req, buf, cb, opaque); +} + +BlockAIOCB *replay_aio_write_zeroes(BlockBackend *blk, int64_t sector_num, + int nb_sectors, BdrvRequestFlags flags, + BlockCompletionFunc *cb, void *opaque) +{ + return blk_aio_write_zeroes_impl(blk, sector_num, nb_sectors, flags, cb, opaque); +} + +void replay_bh_schedule_event(QEMUBH *bh) +{ + qemu_bh_schedule(bh); +}
This patch introduces a set of functions that implement recording and replaying of block devices' operations. These functions form a thin layer between blk_aio_ functions and replay subsystem. All asynchronous block requests are added to the queue to be processed at deterministically invoked record/replay checkpoints. Queue is flushed at checkpoints and information about processed requests is recorded to the log. In replay phase the queue is matched with events read from the log. Therefore block devices requests are processed deterministically. Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> --- block/block-backend.c | 71 +++++++++++--- include/sysemu/block-backend.h | 16 +++ include/sysemu/replay.h | 27 +++++ replay/Makefile.objs | 1 replay/replay-block.c | 202 ++++++++++++++++++++++++++++++++++++++++ replay/replay-events.c | 51 ++++++++++ replay/replay-internal.h | 24 +++++ stubs/replay.c | 46 +++++++++ 8 files changed, 423 insertions(+), 15 deletions(-) create mode 100755 replay/replay-block.c