diff mbox series

[v9,19/21] replay: add BH oneshot event for block layer

Message ID 154703599022.13472.1704009113561450219.stgit@pasha-VirtualBox (mailing list archive)
State New, archived
Headers show
Series Fixing record/replay and adding reverse debugging | expand

Commit Message

Pavel Dovgalyuk Jan. 9, 2019, 12:13 p.m. UTC
Replay is capable of recording normal BH events, but sometimes
there are single use callbacks scheduled with aio_bh_schedule_oneshot
function. This patch enables recording and replaying such callbacks.
Block layer uses these events for calling the completion function.
Replaying these calls makes the execution deterministic.

Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>

--

v6:
 - moved stub function to the separate file for fixing linux-user build
---
 block/block-backend.c    |    5 +++--
 include/sysemu/replay.h  |    3 +++
 replay/replay-events.c   |   16 ++++++++++++++++
 replay/replay-internal.h |    1 +
 replay/replay.c          |    2 +-
 stubs/Makefile.objs      |    1 +
 stubs/replay-user.c      |    9 +++++++++
 7 files changed, 34 insertions(+), 3 deletions(-)
 create mode 100644 stubs/replay-user.c

Comments

Kevin Wolf Jan. 11, 2019, 10:49 a.m. UTC | #1
Am 09.01.2019 um 13:13 hat Pavel Dovgalyuk geschrieben:
> Replay is capable of recording normal BH events, but sometimes
> there are single use callbacks scheduled with aio_bh_schedule_oneshot
> function. This patch enables recording and replaying such callbacks.
> Block layer uses these events for calling the completion function.
> Replaying these calls makes the execution deterministic.
> 
> Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>

This still doesn't come even close to catching all BHs that need to be
caught. While you managed to show a few BHs that actually don't need to
be considered for recording when I asked for this in v7, most BHs in the
block layer can in some way lead to device callbacks and must therefore
be recorded.

How bad would it be to record some BHs even if recording them isn't
necessary? I'd definitely try to err on the safe side here. Having two
different sets of BH functions, you can't expect that people always use
the right one (especially if you don't even make the existing code base
consistently use the right one intially).

Kevin
Pavel Dovgalyuk Jan. 14, 2019, 11:10 a.m. UTC | #2
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 09.01.2019 um 13:13 hat Pavel Dovgalyuk geschrieben:
> > Replay is capable of recording normal BH events, but sometimes
> > there are single use callbacks scheduled with aio_bh_schedule_oneshot
> > function. This patch enables recording and replaying such callbacks.
> > Block layer uses these events for calling the completion function.
> > Replaying these calls makes the execution deterministic.
> >
> > Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
> 
> This still doesn't come even close to catching all BHs that need to be
> caught. While you managed to show a few BHs that actually don't need to
> be considered for recording when I asked for this in v7, most BHs in the
> block layer can in some way lead to device callbacks and must therefore
> be recorded.

Let's have a brief review. I can change all the places, but how
should I make a test case to be sure, that all of them are working ok?

aio_bh_schedule_oneshot is used in:
 - blk_abort_aio_request
 - bdrv_co_yield_to_drain
 - iscsi_co_generic_cb
 - nfs_co_generic_cb
 - null_aio_common
 - nvme_process_completion
 - nvme_rw_cb
 - rbd_finish_aiocb
 - vxhs_iio_callback
 - (and couple of others not in the block layer)

We must change this call to replay_bh_schedule_oneshot_event when
the result of the BH execution affects the replayed guest state
(e.g., interrupt request is generated or memory is written)

If you think that all of these can do that, then I should change
such function calls.

> How bad would it be to record some BHs even if recording them isn't
> necessary? I'd definitely try to err on the safe side here. Having two
> different sets of BH functions, you can't expect that people always use
> the right one (especially if you don't even make the existing code base
> consistently use the right one intially).

There are two possible options:
1. Execution hangs when recording. Kind of deadlock caused by the incorrect
   management of the events. E.g., adding stopping the VM and trying to flush
   the block layer queue.
2. Execution hangs when replaying. 
   One of the events that affect the guest state is missed or generated
   at the other moment (e.g., when BH is not linked to the execution step).
   Then the guest behaves differently and the order of the events in the log
   does not match the guest state (e.g., interrupt processing is not matched).

Pavel Dovgalyuk
Kevin Wolf Jan. 14, 2019, 11:35 a.m. UTC | #3
Am 14.01.2019 um 12:10 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 09.01.2019 um 13:13 hat Pavel Dovgalyuk geschrieben:
> > > Replay is capable of recording normal BH events, but sometimes
> > > there are single use callbacks scheduled with aio_bh_schedule_oneshot
> > > function. This patch enables recording and replaying such callbacks.
> > > Block layer uses these events for calling the completion function.
> > > Replaying these calls makes the execution deterministic.
> > >
> > > Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
> > 
> > This still doesn't come even close to catching all BHs that need to be
> > caught. While you managed to show a few BHs that actually don't need to
> > be considered for recording when I asked for this in v7, most BHs in the
> > block layer can in some way lead to device callbacks and must therefore
> > be recorded.
> 
> Let's have a brief review. I can change all the places, but how
> should I make a test case to be sure, that all of them are working ok?

The list is changing all the time. This is why I am so concerned about
special-casing a few callers instead of having a generic solution. I
don't know how we could make sure that we call the right function
everywhere.

> aio_bh_schedule_oneshot is used in:
>  - blk_abort_aio_request
>  - bdrv_co_yield_to_drain
>  - iscsi_co_generic_cb
>  - nfs_co_generic_cb
>  - null_aio_common
>  - nvme_process_completion
>  - nvme_rw_cb
>  - rbd_finish_aiocb
>  - vxhs_iio_callback
>  - (and couple of others not in the block layer)

In addition to these, we have at least a few functions that just resume
block layer coroutines rather than directly scheduling a BH with a
callback somewhere in the block layer.

> We must change this call to replay_bh_schedule_oneshot_event when
> the result of the BH execution affects the replayed guest state
> (e.g., interrupt request is generated or memory is written)
> 
> If you think that all of these can do that, then I should change
> such function calls.

I haven't reviewed the code, but these names look like all of them can
eventually call back into the guest devices. They won't do that always,
but potentially.

> > How bad would it be to record some BHs even if recording them isn't
> > necessary? I'd definitely try to err on the safe side here. Having two
> > different sets of BH functions, you can't expect that people always use
> > the right one (especially if you don't even make the existing code base
> > consistently use the right one intially).
> 
> There are two possible options:
> 1. Execution hangs when recording. Kind of deadlock caused by the incorrect
>    management of the events. E.g., adding stopping the VM and trying to flush
>    the block layer queue.
> 2. Execution hangs when replaying. 
>    One of the events that affect the guest state is missed or generated
>    at the other moment (e.g., when BH is not linked to the execution step).
>    Then the guest behaves differently and the order of the events in the log
>    does not match the guest state (e.g., interrupt processing is not matched).

So basically when you have two events that are kind of nested? Operation
A triggers event A, but in order to complete the operation, you call
operation B with event B internally, which isn't available yet because
we're still handling event A?

Could this be solved by not having an order of events, but an order of
sets of events?

Kevin
Pavel Dovgalyuk Jan. 14, 2019, 11:48 a.m. UTC | #4
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 14.01.2019 um 12:10 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 09.01.2019 um 13:13 hat Pavel Dovgalyuk geschrieben:
> > > > Replay is capable of recording normal BH events, but sometimes
> > > > there are single use callbacks scheduled with aio_bh_schedule_oneshot
> > > > function. This patch enables recording and replaying such callbacks.
> > > > Block layer uses these events for calling the completion function.
> > > > Replaying these calls makes the execution deterministic.
> > > >
> > > > Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
> > >
> > > This still doesn't come even close to catching all BHs that need to be
> > > caught. While you managed to show a few BHs that actually don't need to
> > > be considered for recording when I asked for this in v7, most BHs in the
> > > block layer can in some way lead to device callbacks and must therefore
> > > be recorded.
> >
> > Let's have a brief review. I can change all the places, but how
> > should I make a test case to be sure, that all of them are working ok?
> 
> The list is changing all the time. This is why I am so concerned about
> special-casing a few callers instead of having a generic solution. I
> don't know how we could make sure that we call the right function
> everywhere.
> 
> > aio_bh_schedule_oneshot is used in:
> >  - blk_abort_aio_request
> >  - bdrv_co_yield_to_drain
> >  - iscsi_co_generic_cb
> >  - nfs_co_generic_cb
> >  - null_aio_common
> >  - nvme_process_completion
> >  - nvme_rw_cb
> >  - rbd_finish_aiocb
> >  - vxhs_iio_callback
> >  - (and couple of others not in the block layer)
> 
> In addition to these, we have at least a few functions that just resume
> block layer coroutines rather than directly scheduling a BH with a
> callback somewhere in the block layer.
> 
> > We must change this call to replay_bh_schedule_oneshot_event when
> > the result of the BH execution affects the replayed guest state
> > (e.g., interrupt request is generated or memory is written)
> >
> > If you think that all of these can do that, then I should change
> > such function calls.
> 
> I haven't reviewed the code, but these names look like all of them can
> eventually call back into the guest devices. They won't do that always,
> but potentially.

Ok, then I'll try making all of them deterministic.

> > > How bad would it be to record some BHs even if recording them isn't
> > > necessary? I'd definitely try to err on the safe side here. Having two
> > > different sets of BH functions, you can't expect that people always use
> > > the right one (especially if you don't even make the existing code base
> > > consistently use the right one intially).
> >
> > There are two possible options:
> > 1. Execution hangs when recording. Kind of deadlock caused by the incorrect
> >    management of the events. E.g., adding stopping the VM and trying to flush
> >    the block layer queue.
> > 2. Execution hangs when replaying.
> >    One of the events that affect the guest state is missed or generated
> >    at the other moment (e.g., when BH is not linked to the execution step).
> >    Then the guest behaves differently and the order of the events in the log
> >    does not match the guest state (e.g., interrupt processing is not matched).
> 
> So basically when you have two events that are kind of nested? 

Can you give a concrete example?

> Operation
> A triggers event A, but in order to complete the operation, you call
> operation B with event B internally, which isn't available yet because
> we're still handling event A?
> 
> Could this be solved by not having an order of events, but an order of
> sets of events?

There are two kind of operations: synchronous and not.
E.g., querying the disk contents through the PIO waits until the disk operation
is finished, and does not need other synchronization. DMA operations are processed
by coroutines and bottom halves. Synchronization of these objects should be
recorded, because the moment of invocation may be different, and it affects
the guest state.

Pavel Dovgalyuk
Pavel Dovgalyuk Feb. 13, 2019, 5:47 a.m. UTC | #5
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 14.01.2019 um 12:10 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 09.01.2019 um 13:13 hat Pavel Dovgalyuk geschrieben:
> > > > Replay is capable of recording normal BH events, but sometimes
> > > > there are single use callbacks scheduled with aio_bh_schedule_oneshot
> > > > function. This patch enables recording and replaying such callbacks.
> > > > Block layer uses these events for calling the completion function.
> > > > Replaying these calls makes the execution deterministic.
> > > >
> > > > Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
> > >
> > > This still doesn't come even close to catching all BHs that need to be
> > > caught. While you managed to show a few BHs that actually don't need to
> > > be considered for recording when I asked for this in v7, most BHs in the
> > > block layer can in some way lead to device callbacks and must therefore
> > > be recorded.
> >
> > Let's have a brief review. I can change all the places, but how
> > should I make a test case to be sure, that all of them are working ok?
> 
> The list is changing all the time. This is why I am so concerned about
> special-casing a few callers instead of having a generic solution. I
> don't know how we could make sure that we call the right function
> everywhere.

I changed all oneshot invocations in the block layer in the new version.
Can you review it and other block-related patches?

Pavel Dovgalyuk
diff mbox series

Patch

diff --git a/block/block-backend.c b/block/block-backend.c
index 60d37a0..41ee871 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -17,6 +17,7 @@ 
 #include "block/throttle-groups.h"
 #include "sysemu/blockdev.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/replay.h"
 #include "qapi/error.h"
 #include "qapi/qapi-events-block.h"
 #include "qemu/id.h"
@@ -1380,8 +1381,8 @@  static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
 
     acb->has_returned = true;
     if (acb->rwco.ret != NOT_DONE) {
-        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
-                                blk_aio_complete_bh, acb);
+        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                         blk_aio_complete_bh, acb);
     }
 
     return &acb->common;
diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
index 5baf10a..eaad320 100644
--- a/include/sysemu/replay.h
+++ b/include/sysemu/replay.h
@@ -164,6 +164,9 @@  bool replay_events_enabled(void);
 void replay_flush_events(void);
 /*! Adds bottom half event to the queue */
 void replay_bh_schedule_event(QEMUBH *bh);
+/* Adds oneshot bottom half event to the queue */
+void replay_bh_schedule_oneshot_event(AioContext *ctx,
+    QEMUBHFunc *cb, void *opaque);
 /*! Adds input event to the queue */
 void replay_input_event(QemuConsole *src, InputEvent *evt);
 /*! Adds input sync event to the queue */
diff --git a/replay/replay-events.c b/replay/replay-events.c
index d9a2d49..fc9b082 100644
--- a/replay/replay-events.c
+++ b/replay/replay-events.c
@@ -37,6 +37,9 @@  static void replay_run_event(Event *event)
     case REPLAY_ASYNC_EVENT_BH:
         aio_bh_call(event->opaque);
         break;
+    case REPLAY_ASYNC_EVENT_BH_ONESHOT:
+        ((QEMUBHFunc *)event->opaque)(event->opaque2);
+        break;
     case REPLAY_ASYNC_EVENT_INPUT:
         qemu_input_event_send_impl(NULL, (InputEvent *)event->opaque);
         qapi_free_InputEvent((InputEvent *)event->opaque);
@@ -132,6 +135,17 @@  void replay_bh_schedule_event(QEMUBH *bh)
     }
 }
 
+void replay_bh_schedule_oneshot_event(AioContext *ctx,
+    QEMUBHFunc *cb, void *opaque)
+{
+    if (events_enabled) {
+        uint64_t id = replay_get_current_step();
+        replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb, opaque, id);
+    } else {
+        aio_bh_schedule_oneshot(ctx, cb, opaque);
+    }
+}
+
 void replay_add_input_event(struct InputEvent *event)
 {
     replay_add_event(REPLAY_ASYNC_EVENT_INPUT, event, NULL, 0);
@@ -162,6 +176,7 @@  static void replay_save_event(Event *event, int checkpoint)
         /* save event-specific data */
         switch (event->event_kind) {
         case REPLAY_ASYNC_EVENT_BH:
+        case REPLAY_ASYNC_EVENT_BH_ONESHOT:
             replay_put_qword(event->id);
             break;
         case REPLAY_ASYNC_EVENT_INPUT:
@@ -217,6 +232,7 @@  static Event *replay_read_event(int checkpoint)
     /* Events that has not to be in the queue */
     switch (replay_state.read_event_kind) {
     case REPLAY_ASYNC_EVENT_BH:
+    case REPLAY_ASYNC_EVENT_BH_ONESHOT:
         if (replay_state.read_event_id == -1) {
             replay_state.read_event_id = replay_get_qword();
         }
diff --git a/replay/replay-internal.h b/replay/replay-internal.h
index e37b201..af2b941 100644
--- a/replay/replay-internal.h
+++ b/replay/replay-internal.h
@@ -51,6 +51,7 @@  enum ReplayEvents {
 
 enum ReplayAsyncEventKind {
     REPLAY_ASYNC_EVENT_BH,
+    REPLAY_ASYNC_EVENT_BH_ONESHOT,
     REPLAY_ASYNC_EVENT_INPUT,
     REPLAY_ASYNC_EVENT_INPUT_SYNC,
     REPLAY_ASYNC_EVENT_CHAR_READ,
diff --git a/replay/replay.c b/replay/replay.c
index 3996499..fdf1778 100644
--- a/replay/replay.c
+++ b/replay/replay.c
@@ -22,7 +22,7 @@ 
 
 /* Current version of the replay mechanism.
    Increase it when file format changes. */
-#define REPLAY_VERSION              0xe02007
+#define REPLAY_VERSION              0xe02008
 /* Size of replay log header */
 #define HEADER_SIZE                 (sizeof(uint32_t) + sizeof(uint64_t))
 
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index 5dd0aee..8e8df1e 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -24,6 +24,7 @@  stub-obj-y += monitor.o
 stub-obj-y += notify-event.o
 stub-obj-y += qtest.o
 stub-obj-y += replay.o
+stub-obj-y += replay-user.o
 stub-obj-y += runstate-check.o
 stub-obj-y += set-fd-handler.o
 stub-obj-y += slirp.o
diff --git a/stubs/replay-user.c b/stubs/replay-user.c
new file mode 100644
index 0000000..2ad9e27
--- /dev/null
+++ b/stubs/replay-user.c
@@ -0,0 +1,9 @@ 
+#include "qemu/osdep.h"
+#include "sysemu/replay.h"
+#include "sysemu/sysemu.h"
+
+void replay_bh_schedule_oneshot_event(AioContext *ctx,
+    QEMUBHFunc *cb, void *opaque)
+{
+    aio_bh_schedule_oneshot(ctx, cb, opaque);
+}