Message ID | 20181010133357.24538.19180.stgit@pasha-VirtualBox (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fixing record/replay and adding reverse debugging | expand |
Am 10.10.2018 um 15:33 hat Pavel Dovgalyuk geschrieben: > In record/replay mode bdrv queue is controlled by replay mechanism. > It does not allow saving or loading the snapshots > when bdrv queue is not empty. Stopping the VM is not blocked by nonempty > queue, but flushing the queue is still impossible there, > because it may cause deadlocks in replay mode. > This patch disables bdrv_drain_all and bdrv_flush_all in > record/replay mode. > > Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> Disabling bdrv_flush_all() shouldn't make a big difference in replay mode, so I agree with that. But does the bdrv_drain_all() change mean that bdrv_drain_all_begin() can return while there are still requests in flight? This sounds like something that could break some existing code, because the meaning of the function is that if it returns success, no requests are in flight any more. What would move the request forward in the replay case once it has been created? Does this also involve some user interaction? Or may we process the next event in a drain callback of blkreplay or something? Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 10.10.2018 um 15:33 hat Pavel Dovgalyuk geschrieben: > > In record/replay mode bdrv queue is controlled by replay mechanism. > > It does not allow saving or loading the snapshots > > when bdrv queue is not empty. Stopping the VM is not blocked by nonempty > > queue, but flushing the queue is still impossible there, > > because it may cause deadlocks in replay mode. > > This patch disables bdrv_drain_all and bdrv_flush_all in > > record/replay mode. > > > > Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> > > Disabling bdrv_flush_all() shouldn't make a big difference in replay > mode, so I agree with that. > > But does the bdrv_drain_all() change mean that bdrv_drain_all_begin() > can return while there are still requests in flight? This sounds like > something that could break some existing code, because the meaning of > the function is that if it returns success, no requests are in flight > any more. > > What would move the request forward in the replay case once it has been > created? Does this also involve some user interaction? Or may we process > the next event in a drain callback of blkreplay or something? You are right - there are some operation that can't be performed at any time during the replay, because of unfinished block requests. One of these is creating the VM snapshot. I've added a check in another patch, which prevents snapshotting while the event queue (which holds the block requests) is not empty. There could also be other things, that will be fixed when we try to use it. Moving replay forward until the queue is empty is not a good idea, because step-by-step debugging should be available and work without skipping the instructions. Pavel Dovgalyuk
Am 30.11.2018 um 08:55 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > Am 10.10.2018 um 15:33 hat Pavel Dovgalyuk geschrieben: > > > In record/replay mode bdrv queue is controlled by replay mechanism. > > > It does not allow saving or loading the snapshots > > > when bdrv queue is not empty. Stopping the VM is not blocked by nonempty > > > queue, but flushing the queue is still impossible there, > > > because it may cause deadlocks in replay mode. > > > This patch disables bdrv_drain_all and bdrv_flush_all in > > > record/replay mode. > > > > > > Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> > > > > Disabling bdrv_flush_all() shouldn't make a big difference in replay > > mode, so I agree with that. > > > > But does the bdrv_drain_all() change mean that bdrv_drain_all_begin() > > can return while there are still requests in flight? This sounds like > > something that could break some existing code, because the meaning of > > the function is that if it returns success, no requests are in flight > > any more. > > > > What would move the request forward in the replay case once it has been > > created? Does this also involve some user interaction? Or may we process > > the next event in a drain callback of blkreplay or something? > > You are right - there are some operation that can't be performed at any > time during the replay, because of unfinished block requests. > > One of these is creating the VM snapshot. I've added a check in another > patch, which prevents snapshotting while the event queue (which holds > the block requests) is not empty. > > There could also be other things, that will be fixed when we try to use it. > > Moving replay forward until the queue is empty is not a good idea, because > step-by-step debugging should be available and work without skipping the instructions. So if the idea is that in replay mode, bdrv_drain_all_begin() is only called when there are no requests in flight anyway (because callers catch this situation earler), can we make this an assertion that the block device is already quiesced? Skipping drain without knowing that there are no requests in flight feels simply wrong, and I'm almost sure it will result in bugs. On the other hand, if we know that no requests are in flight, there's no real point in skipping. So I think I still don't understand the situation where skipping a drain is the correct solution. Kevin
> From: Kevin Wolf [mailto:kwolf@redhat.com] > Am 30.11.2018 um 08:55 hat Pavel Dovgalyuk geschrieben: > > > From: Kevin Wolf [mailto:kwolf@redhat.com] > > > Am 10.10.2018 um 15:33 hat Pavel Dovgalyuk geschrieben: > > > > In record/replay mode bdrv queue is controlled by replay mechanism. > > > > It does not allow saving or loading the snapshots > > > > when bdrv queue is not empty. Stopping the VM is not blocked by nonempty > > > > queue, but flushing the queue is still impossible there, > > > > because it may cause deadlocks in replay mode. > > > > This patch disables bdrv_drain_all and bdrv_flush_all in > > > > record/replay mode. > > > > > > > > Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> > > > > > > Disabling bdrv_flush_all() shouldn't make a big difference in replay > > > mode, so I agree with that. > > > > > > But does the bdrv_drain_all() change mean that bdrv_drain_all_begin() > > > can return while there are still requests in flight? This sounds like > > > something that could break some existing code, because the meaning of > > > the function is that if it returns success, no requests are in flight > > > any more. > > > > > > What would move the request forward in the replay case once it has been > > > created? Does this also involve some user interaction? Or may we process > > > the next event in a drain callback of blkreplay or something? > > > > You are right - there are some operation that can't be performed at any > > time during the replay, because of unfinished block requests. > > > > One of these is creating the VM snapshot. I've added a check in another > > patch, which prevents snapshotting while the event queue (which holds > > the block requests) is not empty. > > > > There could also be other things, that will be fixed when we try to use it. > > > > Moving replay forward until the queue is empty is not a good idea, because > > step-by-step debugging should be available and work without skipping the instructions. > > So if the idea is that in replay mode, bdrv_drain_all_begin() is only > called when there are no requests in flight anyway (because callers > catch this situation earler), can we make this an assertion that the > block device is already quiesced? Maybe it's behavior changed in the latest version? We observed that bdrv_drain_all hangs, when there is a block request in the queue. Therefore we decided that stopping with non-empty queue is better than skipping some steps. > Skipping drain without knowing that there are no requests in flight > feels simply wrong, and I'm almost sure it will result in bugs. On the > other hand, if we know that no requests are in flight, there's no real > point in skipping. What bugs can happen? Pavel Dovgalyuk
diff --git a/block/io.c b/block/io.c index bd9d688..96fb245 100644 --- a/block/io.c +++ b/block/io.c @@ -32,6 +32,7 @@ #include "qemu/cutils.h" #include "qapi/error.h" #include "qemu/error-report.h" +#include "sysemu/replay.h" #define NOT_DONE 0x7fffffff /* used while emulated sync operation in progress */ @@ -538,6 +539,13 @@ void bdrv_drain_all_begin(void) return; } + /* bdrv queue is managed by record/replay, + waiting for finishing the I/O requests may + be infinite */ + if (replay_events_enabled()) { + return; + } + /* AIO_WAIT_WHILE() with a NULL context can only be called from the main * loop AioContext, so make sure we're in the main context. */ assert(qemu_get_current_aio_context() == qemu_get_aio_context()); @@ -566,6 +574,13 @@ void bdrv_drain_all_end(void) { BlockDriverState *bs = NULL; + /* bdrv queue is managed by record/replay, + waiting for finishing the I/O requests may + be endless */ + if (replay_events_enabled()) { + return; + } + while ((bs = bdrv_next_all_states(bs))) { AioContext *aio_context = bdrv_get_aio_context(bs); @@ -1997,6 +2012,13 @@ int bdrv_flush_all(void) BlockDriverState *bs = NULL; int result = 0; + /* bdrv queue is managed by record/replay, + creating new flush request for stopping + the VM may break the determinism */ + if (replay_events_enabled()) { + return result; + } + for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) { AioContext *aio_context = bdrv_get_aio_context(bs); int ret; diff --git a/cpus.c b/cpus.c index 1c741bc..d60ddf1 100644 --- a/cpus.c +++ b/cpus.c @@ -1078,7 +1078,6 @@ static int do_vm_stop(RunState state, bool send_stop) } bdrv_drain_all(); - replay_disable_events(); ret = bdrv_flush_all(); return ret; @@ -2135,7 +2134,6 @@ int vm_prepare_start(void) /* We are sending this now, but the CPUs will be resumed shortly later */ qapi_event_send_resume(); - replay_enable_events(); cpu_enable_ticks(); runstate_set(RUN_STATE_RUNNING); vm_state_notify(1, RUN_STATE_RUNNING);
In record/replay mode bdrv queue is controlled by replay mechanism. It does not allow saving or loading the snapshots when bdrv queue is not empty. Stopping the VM is not blocked by nonempty queue, but flushing the queue is still impossible there, because it may cause deadlocks in replay mode. This patch disables bdrv_drain_all and bdrv_flush_all in record/replay mode. Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> --- block/io.c | 22 ++++++++++++++++++++++ cpus.c | 2 -- 2 files changed, 22 insertions(+), 2 deletions(-)