diff mbox series

[v3,03/10] migration: Refactor error handling in source return path

Message ID 20231004220240.167175-4-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series migration: Better error handling in rp thread, allow failures in recover | expand

Commit Message

Peter Xu Oct. 4, 2023, 10:02 p.m. UTC
rp_state.error was a boolean used to show error happened in return path
thread.  That's not only duplicating error reporting (migrate_set_error),
but also not good enough in that we only do error_report() and set it to
true, we never can keep a history of the exact error and show it in
query-migrate.

To make this better, a few things done:

  - Use error_setg() rather than error_report() across the whole lifecycle
    of return path thread, keeping the error in an Error*.

  - Use migrate_set_error() to apply that captured error to the global
    migration object when error occured in this thread.

  - With above, no need to have mark_source_rp_bad(), remove it, alongside
    with rp_state.error itself.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h  |   1 -
 migration/ram.h        |   5 +-
 migration/migration.c  | 123 ++++++++++++++++++-----------------------
 migration/ram.c        |  41 +++++++-------
 migration/trace-events |   4 +-
 5 files changed, 79 insertions(+), 95 deletions(-)

Comments

Philippe Mathieu-Daudé Oct. 5, 2023, 6:11 a.m. UTC | #1
Hi Peter,

On 5/10/23 00:02, Peter Xu wrote:
> rp_state.error was a boolean used to show error happened in return path
> thread.  That's not only duplicating error reporting (migrate_set_error),
> but also not good enough in that we only do error_report() and set it to
> true, we never can keep a history of the exact error and show it in
> query-migrate.
> 
> To make this better, a few things done:
> 
>    - Use error_setg() rather than error_report() across the whole lifecycle
>      of return path thread, keeping the error in an Error*.
> 
>    - Use migrate_set_error() to apply that captured error to the global
>      migration object when error occured in this thread.
> 
>    - With above, no need to have mark_source_rp_bad(), remove it, alongside
>      with rp_state.error itself.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   migration/migration.h  |   1 -
>   migration/ram.h        |   5 +-
>   migration/migration.c  | 123 ++++++++++++++++++-----------------------
>   migration/ram.c        |  41 +++++++-------
>   migration/trace-events |   4 +-
>   5 files changed, 79 insertions(+), 95 deletions(-)


> -int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
> +int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
>   {
>       int ret = -EINVAL;
>       /* from_dst_file is always valid because we're within rp_thread */


> @@ -4193,16 +4194,16 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
>   
>       ret = qemu_file_get_error(file);
>       if (ret || size != local_size) {
> -        error_report("%s: read bitmap failed for ramblock '%s': %d"
> -                     " (size 0x%"PRIx64", got: 0x%"PRIx64")",
> -                     __func__, block->idstr, ret, local_size, size);
> +        error_setg(errp, "read bitmap failed for ramblock '%s': %d"
> +                   " (size 0x%"PRIx64", got: 0x%"PRIx64")",
> +                   block->idstr, ret, local_size, size);
>           ret = -EIO;
>           goto out;
>       }
>   
>       if (end_mark != RAMBLOCK_RECV_BITMAP_ENDING) {
> -        error_report("%s: ramblock '%s' end mark incorrect: 0x%"PRIx64,
> -                     __func__, block->idstr, end_mark);
> +        error_setg(errp, "ramblock '%s' end mark incorrect: 0x%"PRIx64,
> +                   block->idstr, end_mark);
>           ret = -EINVAL;
>           goto out;
>       }

This function returns -EIO/-EINVAL errors, propagated to its 2 callers
  - migrate_handle_rp_recv_bitmap()
  - migrate_handle_rp_resume_ack()
which are only used in source_return_path_thread() where the return
value is only checked as boolean.

Could we simplify them returning a boolean (which is the pattern with
functions taking an Error** as last parameter)?

Regards,

Phil.
Juan Quintela Oct. 5, 2023, 8:22 a.m. UTC | #2
Peter Xu <peterx@redhat.com> wrote:
> rp_state.error was a boolean used to show error happened in return path
> thread.  That's not only duplicating error reporting (migrate_set_error),
> but also not good enough in that we only do error_report() and set it to
> true, we never can keep a history of the exact error and show it in
> query-migrate.
>
> To make this better, a few things done:
>
>   - Use error_setg() rather than error_report() across the whole lifecycle
>     of return path thread, keeping the error in an Error*.

Good.

>   - Use migrate_set_error() to apply that captured error to the global
>     migration object when error occured in this thread.

Good.

>   - With above, no need to have mark_source_rp_bad(), remove it, alongside
>     with rp_state.error itself.

Good.

>  uint64_t ram_pagesize_summary(void);
> -int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len);
> +int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len,
> +                         Error **errp);


good.

> @@ -1793,37 +1782,36 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
>       */
>      if (!QEMU_IS_ALIGNED(start, our_host_ps) ||
>          !QEMU_IS_ALIGNED(len, our_host_ps)) {
> -        error_report("%s: Misaligned page request, start: " RAM_ADDR_FMT
> -                     " len: %zd", __func__, start, len);
> -        mark_source_rp_bad(ms);
> +        error_setg(errp, "MIG_RP_MSG_REQ_PAGES: Misaligned page request, start:"
> +                   RAM_ADDR_FMT " len: %zd", start, len);
>          return;
>      }
>  
> -    if (ram_save_queue_pages(rbname, start, len)) {
> -        mark_source_rp_bad(ms);
> -    }
> +    ram_save_queue_pages(rbname, start, len, errp);

ram_save_queue_pages() returns an int.
I think this function should return an int.

Next is independent of this patch:

> -static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name)
> +static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name,
> +                                         Error **errp)
>  {
>      RAMBlock *block = qemu_ram_block_by_name(block_name);
>  
>      if (!block) {
> -        error_report("%s: invalid block name '%s'", __func__, block_name);
> +        error_setg(errp, "MIG_RP_MSG_RECV_BITMAP has invalid block name '%s'",
> +                   block_name);
>          return -EINVAL;

We sent -EINVAL.

>      }
>  
>      /* Fetch the received bitmap and refresh the dirty bitmap */
> -    return ram_dirty_bitmap_reload(s, block);
> +    return ram_dirty_bitmap_reload(s, block, errp);
>  }
>  
> -static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
> +static int migrate_handle_rp_resume_ack(MigrationState *s,
> +                                        uint32_t value, Error **errp)
>  {
>      trace_source_return_path_thread_resume_ack(value);
>  
>      if (value != MIGRATION_RESUME_ACK_VALUE) {
> -        error_report("%s: illegal resume_ack value %"PRIu32,
> -                     __func__, value);
> +        error_setg(errp, "illegal resume_ack value %"PRIu32, value);
>          return -1;

And here -1.

On both callers we just check if it is different from zero.  We never
use the return value as errno, so I think we should move to -1, if there
is an error, that is what errp is for.


> -/* Returns 0 if the RP was ok, otherwise there was an error on the RP */
> -static int await_return_path_close_on_source(MigrationState *ms)
> +static void await_return_path_close_on_source(MigrationState *ms)
>  {
> -    int ret;
> -
>      if (!ms->rp_state.rp_thread_created) {
> -        return 0;
> +        return;
>      }
>  
>      trace_migration_return_path_end_before();
> @@ -2060,18 +2050,10 @@ static int await_return_path_close_on_source(MigrationState *ms)
>          }
>      }
>  
> -    trace_await_return_path_close_on_source_joining();
>      qemu_thread_join(&ms->rp_state.rp_thread);
>      ms->rp_state.rp_thread_created = false;
> -    trace_await_return_path_close_on_source_close();
> -
> -    ret = ms->rp_state.error;
> -    ms->rp_state.error = false;
> -
>      migration_release_dst_files(ms);
> -
> -    trace_migration_return_path_end_after(ret);
> -    return ret;
> +    trace_migration_return_path_end_after();
>  }
>  
>  static inline void
> @@ -2367,7 +2349,10 @@ static void migration_completion(MigrationState *s)
>          goto fail;
>      }
>  
> -    if (await_return_path_close_on_source(s)) {
> +    await_return_path_close_on_source(s);
> +
> +    /* If return path has error, should have been set here */
> +    if (migrate_has_error(s)) {
>          goto fail;
>      }

In general, I think this is bad.  We are moving for

int foo(..)
{

}

....

if (foo()) {
     goto fail;
}

to:

void foo(..)
{

}

....

foo();

if (bar()) {
     goto fail;
}

I would preffer to move the other way around.  Move the error
synchrconously. My plan is that at some point in time
qemu_file_get_error() dissapears, i.e. we return the error when we
receive it and we handle it synchronously.

And yes, that is a something will take a lot of time, but I will hope we
move on that direction, not in trusting more setting internal errors,
use void functions and then check with yet another functions.


On top of your changes:

> -int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
> +int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
>  {
>      int ret = -EINVAL;
>      /* from_dst_file is always valid because we're within rp_thread */
> @@ -4163,8 +4165,8 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
>      trace_ram_dirty_bitmap_reload_begin(block->idstr);
>  
>      if (s->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
> -        error_report("%s: incorrect state %s", __func__,
> -                     MigrationStatus_str(s->state));
> +        error_setg(errp, "Reload bitmap in incorrect state %s",
> +                   MigrationStatus_str(s->state));
>          return -EINVAL;

return -1

same for the rest of the cases. Callers only check for != 0, and if you
want details, you need to look at errp.

See the nice series for migration/rdma.c for why this is better (and
more consistent).

Rest of the patch is very nice.

Thanks, Juan.
Fabiano Rosas Oct. 5, 2023, 12:57 p.m. UTC | #3
Peter Xu <peterx@redhat.com> writes:

> @@ -1882,48 +1870,46 @@ static void *source_return_path_thread(void *opaque)
>      uint32_t tmp32, sibling_error;
>      ram_addr_t start = 0; /* =0 to silence warning */
>      size_t  len = 0, expected_len;
> +    Error *err = NULL;
>      int res;
>  
>      trace_source_return_path_thread_entry();
>      rcu_register_thread();
>  
> -    while (!ms->rp_state.error && !qemu_file_get_error(rp) &&
> +    while (!migrate_has_error(ms) && !qemu_file_get_error(rp) &&
>             migration_is_setup_or_active(ms->state)) {
>          trace_source_return_path_thread_loop_top();
> +
>          header_type = qemu_get_be16(rp);
>          header_len = qemu_get_be16(rp);
>  
>          if (qemu_file_get_error(rp)) {
> -            mark_source_rp_bad(ms);
>              goto out;
>          }

This error will be lost because outside the loop we only check for err.

>  
>          if (header_type >= MIG_RP_MSG_MAX ||
>              header_type == MIG_RP_MSG_INVALID) {
> -            error_report("RP: Received invalid message 0x%04x length 0x%04x",
> -                         header_type, header_len);
> -            mark_source_rp_bad(ms);
> +            error_setg(&err, "Received invalid message 0x%04x length 0x%04x",
> +                       header_type, header_len);
>              goto out;
>          }
>  
>          if ((rp_cmd_args[header_type].len != -1 &&
>              header_len != rp_cmd_args[header_type].len) ||
>              header_len > sizeof(buf)) {
> -            error_report("RP: Received '%s' message (0x%04x) with"
> -                         "incorrect length %d expecting %zu",
> -                         rp_cmd_args[header_type].name, header_type, header_len,
> -                         (size_t)rp_cmd_args[header_type].len);
> -            mark_source_rp_bad(ms);
> +            error_setg(&err, "Received '%s' message (0x%04x) with"
> +                       "incorrect length %d expecting %zu",
> +                       rp_cmd_args[header_type].name, header_type, header_len,
> +                       (size_t)rp_cmd_args[header_type].len);
>              goto out;
>          }
>  
>          /* We know we've got a valid header by this point */
>          res = qemu_get_buffer(rp, buf, header_len);
>          if (res != header_len) {
> -            error_report("RP: Failed reading data for message 0x%04x"
> -                         " read %d expected %d",
> -                         header_type, res, header_len);
> -            mark_source_rp_bad(ms);
> +            error_setg(&err, "Failed reading data for message 0x%04x"
> +                       " read %d expected %d",
> +                       header_type, res, header_len);
>              goto out;
>          }
>  
> @@ -1933,8 +1919,7 @@ static void *source_return_path_thread(void *opaque)
>              sibling_error = ldl_be_p(buf);
>              trace_source_return_path_thread_shut(sibling_error);
>              if (sibling_error) {
> -                error_report("RP: Sibling indicated error %d", sibling_error);
> -                mark_source_rp_bad(ms);
> +                error_setg(&err, "Sibling indicated error %d", sibling_error);
>              }
>              /*
>               * We'll let the main thread deal with closing the RP
> @@ -1952,7 +1937,10 @@ static void *source_return_path_thread(void *opaque)
>          case MIG_RP_MSG_REQ_PAGES:
>              start = ldq_be_p(buf);
>              len = ldl_be_p(buf + 8);
> -            migrate_handle_rp_req_pages(ms, NULL, start, len);
> +            migrate_handle_rp_req_pages(ms, NULL, start, len, &err);
> +            if (err) {
> +                goto out;
> +            }
>              break;
>  
>          case MIG_RP_MSG_REQ_PAGES_ID:
> @@ -1967,32 +1955,32 @@ static void *source_return_path_thread(void *opaque)
>                  expected_len += tmp32;
>              }
>              if (header_len != expected_len) {
> -                error_report("RP: Req_Page_id with length %d expecting %zd",
> -                             header_len, expected_len);
> -                mark_source_rp_bad(ms);
> +                error_setg(&err, "Req_Page_id with length %d expecting %zd",
> +                           header_len, expected_len);
> +                goto out;
> +            }
> +            migrate_handle_rp_req_pages(ms, (char *)&buf[13], start, len,
> +                                        &err);
> +            if (err) {
>                  goto out;
>              }
> -            migrate_handle_rp_req_pages(ms, (char *)&buf[13], start, len);
>              break;
>  
>          case MIG_RP_MSG_RECV_BITMAP:
>              if (header_len < 1) {
> -                error_report("%s: missing block name", __func__);
> -                mark_source_rp_bad(ms);
> +                error_setg(&err, "MIG_RP_MSG_RECV_BITMAP missing block name");
>                  goto out;
>              }
>              /* Format: len (1B) + idstr (<255B). This ends the idstr. */
>              buf[buf[0] + 1] = '\0';
> -            if (migrate_handle_rp_recv_bitmap(ms, (char *)(buf + 1))) {
> -                mark_source_rp_bad(ms);
> +            if (migrate_handle_rp_recv_bitmap(ms, (char *)(buf + 1), &err)) {
>                  goto out;
>              }
>              break;
>  
>          case MIG_RP_MSG_RESUME_ACK:
>              tmp32 = ldl_be_p(buf);
> -            if (migrate_handle_rp_resume_ack(ms, tmp32)) {
> -                mark_source_rp_bad(ms);
> +            if (migrate_handle_rp_resume_ack(ms, tmp32, &err)) {
>                  goto out;
>              }
>              break;
> @@ -2008,9 +1996,14 @@ static void *source_return_path_thread(void *opaque)
>      }
>  
>  out:
> -    if (qemu_file_get_error(rp)) {
> +    if (err) {

Need to keep both checks here.

> +        /*
> +         * Collect any error in return-path thread and report it to the
> +         * migration state object.
> +         */
> +        migrate_set_error(ms, err);
> +        error_free(err);
>          trace_source_return_path_thread_bad_end();
> -        mark_source_rp_bad(ms);
>      }
>  
>      trace_source_return_path_thread_end();
> @@ -2036,13 +2029,10 @@ static int open_return_path_on_source(MigrationState *ms)
>      return 0;
>  }
Peter Xu Oct. 5, 2023, 4:05 p.m. UTC | #4
On Thu, Oct 05, 2023 at 08:11:33AM +0200, Philippe Mathieu-Daudé wrote:
> Hi Peter,
> 
> On 5/10/23 00:02, Peter Xu wrote:
> > rp_state.error was a boolean used to show error happened in return path
> > thread.  That's not only duplicating error reporting (migrate_set_error),
> > but also not good enough in that we only do error_report() and set it to
> > true, we never can keep a history of the exact error and show it in
> > query-migrate.
> > 
> > To make this better, a few things done:
> > 
> >    - Use error_setg() rather than error_report() across the whole lifecycle
> >      of return path thread, keeping the error in an Error*.
> > 
> >    - Use migrate_set_error() to apply that captured error to the global
> >      migration object when error occured in this thread.
> > 
> >    - With above, no need to have mark_source_rp_bad(), remove it, alongside
> >      with rp_state.error itself.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   migration/migration.h  |   1 -
> >   migration/ram.h        |   5 +-
> >   migration/migration.c  | 123 ++++++++++++++++++-----------------------
> >   migration/ram.c        |  41 +++++++-------
> >   migration/trace-events |   4 +-
> >   5 files changed, 79 insertions(+), 95 deletions(-)
> 
> 
> > -int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
> > +int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
> >   {
> >       int ret = -EINVAL;
> >       /* from_dst_file is always valid because we're within rp_thread */
> 
> 
> > @@ -4193,16 +4194,16 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
> >       ret = qemu_file_get_error(file);
> >       if (ret || size != local_size) {
> > -        error_report("%s: read bitmap failed for ramblock '%s': %d"
> > -                     " (size 0x%"PRIx64", got: 0x%"PRIx64")",
> > -                     __func__, block->idstr, ret, local_size, size);
> > +        error_setg(errp, "read bitmap failed for ramblock '%s': %d"
> > +                   " (size 0x%"PRIx64", got: 0x%"PRIx64")",
> > +                   block->idstr, ret, local_size, size);
> >           ret = -EIO;
> >           goto out;
> >       }
> >       if (end_mark != RAMBLOCK_RECV_BITMAP_ENDING) {
> > -        error_report("%s: ramblock '%s' end mark incorrect: 0x%"PRIx64,
> > -                     __func__, block->idstr, end_mark);
> > +        error_setg(errp, "ramblock '%s' end mark incorrect: 0x%"PRIx64,
> > +                   block->idstr, end_mark);
> >           ret = -EINVAL;
> >           goto out;
> >       }
> 
> This function returns -EIO/-EINVAL errors, propagated to its 2 callers
>  - migrate_handle_rp_recv_bitmap()
>  - migrate_handle_rp_resume_ack()

It was only called in migrate_handle_rp_recv_bitmap(), but I think I see
what you meant..

> which are only used in source_return_path_thread() where the return
> value is only checked as boolean.
> 
> Could we simplify them returning a boolean (which is the pattern with
> functions taking an Error** as last parameter)?

Yes, with errp passed in, the "int" retcode is slightly duplicated.  I can
add one more patch on top of this as further cleanup, as below.

Thanks,

===8<===
From b1052befd72beb129012afddf5647339fe4e257c Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 5 Oct 2023 12:03:44 -0400
Subject: [PATCH] migration: Change ram_dirty_bitmap_reload() retval to bool
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Now we have a Error** passed into the return path thread stack, which is
even clearer than an int retval.  Change ram_dirty_bitmap_reload() and the
callers to use a bool instead to replace errnos.

Suggested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.h       |  2 +-
 migration/migration.c | 18 +++++++++---------
 migration/ram.c       | 24 +++++++++++-------------
 3 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/migration/ram.h b/migration/ram.h
index 14ed666d58..af0290f8ab 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -72,7 +72,7 @@ void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
 void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr, size_t nr);
 int64_t ramblock_recv_bitmap_send(QEMUFile *file,
                                   const char *block_name);
-int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb, Error **errp);
+bool ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb, Error **errp);
 bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
 void postcopy_preempt_shutdown_file(MigrationState *s);
 void *postcopy_preempt_thread(void *opaque);
diff --git a/migration/migration.c b/migration/migration.c
index 1a7f214fcf..e7375810be 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1837,29 +1837,29 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
     ram_save_queue_pages(rbname, start, len, errp);
 }
 
-static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name,
-                                         Error **errp)
+static bool migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name,
+                                          Error **errp)
 {
     RAMBlock *block = qemu_ram_block_by_name(block_name);
 
     if (!block) {
         error_setg(errp, "MIG_RP_MSG_RECV_BITMAP has invalid block name '%s'",
                    block_name);
-        return -EINVAL;
+        return false;
     }
 
     /* Fetch the received bitmap and refresh the dirty bitmap */
     return ram_dirty_bitmap_reload(s, block, errp);
 }
 
-static int migrate_handle_rp_resume_ack(MigrationState *s,
-                                        uint32_t value, Error **errp)
+static bool migrate_handle_rp_resume_ack(MigrationState *s,
+                                         uint32_t value, Error **errp)
 {
     trace_source_return_path_thread_resume_ack(value);
 
     if (value != MIGRATION_RESUME_ACK_VALUE) {
         error_setg(errp, "illegal resume_ack value %"PRIu32, value);
-        return -1;
+        return false;
     }
 
     /* Now both sides are active. */
@@ -1869,7 +1869,7 @@ static int migrate_handle_rp_resume_ack(MigrationState *s,
     /* Notify send thread that time to continue send pages */
     migration_rp_kick(s);
 
-    return 0;
+    return true;
 }
 
 /*
@@ -2021,14 +2021,14 @@ static void *source_return_path_thread(void *opaque)
             }
             /* Format: len (1B) + idstr (<255B). This ends the idstr. */
             buf[buf[0] + 1] = '\0';
-            if (migrate_handle_rp_recv_bitmap(ms, (char *)(buf + 1), &err)) {
+            if (!migrate_handle_rp_recv_bitmap(ms, (char *)(buf + 1), &err)) {
                 goto out;
             }
             break;
 
         case MIG_RP_MSG_RESUME_ACK:
             tmp32 = ldl_be_p(buf);
-            if (migrate_handle_rp_resume_ack(ms, tmp32, &err)) {
+            if (!migrate_handle_rp_resume_ack(ms, tmp32, &err)) {
                 goto out;
             }
             break;
diff --git a/migration/ram.c b/migration/ram.c
index 2565f53f5c..982fbbeee1 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4157,23 +4157,25 @@ static int ram_dirty_bitmap_sync_all(MigrationState *s, RAMState *rs)
  * Read the received bitmap, revert it as the initial dirty bitmap.
  * This is only used when the postcopy migration is paused but wants
  * to resume from a middle point.
+ *
+ * Returns true if succeeded, false for errors.
  */
-int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
+bool ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
 {
-    int ret = -EINVAL;
     /* from_dst_file is always valid because we're within rp_thread */
     QEMUFile *file = s->rp_state.from_dst_file;
     unsigned long *le_bitmap, nbits = block->used_length >> TARGET_PAGE_BITS;
     uint64_t local_size = DIV_ROUND_UP(nbits, 8);
     uint64_t size, end_mark;
     RAMState *rs = ram_state;
+    bool result = false;
 
     trace_ram_dirty_bitmap_reload_begin(block->idstr);
 
     if (s->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
         error_setg(errp, "Reload bitmap in incorrect state %s",
                    MigrationStatus_str(s->state));
-        return -EINVAL;
+        return false;
     }
 
     /*
@@ -4191,26 +4193,22 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
     if (size != local_size) {
         error_setg(errp, "ramblock '%s' bitmap size mismatch (0x%"PRIx64
                    " != 0x%"PRIx64")", block->idstr, size, local_size);
-        ret = -EINVAL;
         goto out;
     }
 
     size = qemu_get_buffer(file, (uint8_t *)le_bitmap, local_size);
     end_mark = qemu_get_be64(file);
 
-    ret = qemu_file_get_error(file);
-    if (ret || size != local_size) {
-        error_setg(errp, "read bitmap failed for ramblock '%s': %d"
-                   " (size 0x%"PRIx64", got: 0x%"PRIx64")",
-                   block->idstr, ret, local_size, size);
-        ret = -EIO;
+    if (qemu_file_get_error(file) || size != local_size) {
+        error_setg(errp, "read bitmap failed for ramblock '%s': "
+                   "(size 0x%"PRIx64", got: 0x%"PRIx64")",
+                   block->idstr, local_size, size);
         goto out;
     }
 
     if (end_mark != RAMBLOCK_RECV_BITMAP_ENDING) {
         error_setg(errp, "ramblock '%s' end mark incorrect: 0x%"PRIx64,
                    block->idstr, end_mark);
-        ret = -EINVAL;
         goto out;
     }
 
@@ -4243,10 +4241,10 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
      */
     migration_rp_kick(s);
 
-    ret = 0;
+    result = true;
 out:
     g_free(le_bitmap);
-    return ret;
+    return result;
 }
 
 static int ram_resume_prepare(MigrationState *s, void *opaque)
Peter Xu Oct. 5, 2023, 7:35 p.m. UTC | #5
On Thu, Oct 05, 2023 at 10:22:52AM +0200, Juan Quintela wrote:
> Peter Xu <peterx@redhat.com> wrote:
> > rp_state.error was a boolean used to show error happened in return path
> > thread.  That's not only duplicating error reporting (migrate_set_error),
> > but also not good enough in that we only do error_report() and set it to
> > true, we never can keep a history of the exact error and show it in
> > query-migrate.
> >
> > To make this better, a few things done:
> >
> >   - Use error_setg() rather than error_report() across the whole lifecycle
> >     of return path thread, keeping the error in an Error*.
> 
> Good.
> 
> >   - Use migrate_set_error() to apply that captured error to the global
> >     migration object when error occured in this thread.
> 
> Good.
> 
> >   - With above, no need to have mark_source_rp_bad(), remove it, alongside
> >     with rp_state.error itself.
> 
> Good.
> 
> >  uint64_t ram_pagesize_summary(void);
> > -int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len);
> > +int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len,
> > +                         Error **errp);
> 
> 
> good.
> 
> > @@ -1793,37 +1782,36 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
> >       */
> >      if (!QEMU_IS_ALIGNED(start, our_host_ps) ||
> >          !QEMU_IS_ALIGNED(len, our_host_ps)) {
> > -        error_report("%s: Misaligned page request, start: " RAM_ADDR_FMT
> > -                     " len: %zd", __func__, start, len);
> > -        mark_source_rp_bad(ms);
> > +        error_setg(errp, "MIG_RP_MSG_REQ_PAGES: Misaligned page request, start:"
> > +                   RAM_ADDR_FMT " len: %zd", start, len);
> >          return;
> >      }
> >  
> > -    if (ram_save_queue_pages(rbname, start, len)) {
> > -        mark_source_rp_bad(ms);
> > -    }
> > +    ram_save_queue_pages(rbname, start, len, errp);
> 
> ram_save_queue_pages() returns an int.
> I think this function should return an int.

Phil suggested something similar for the other patch, instead of also let
this function return int, I'll add one more patch to let it return boolean
to show whether there's an error, keeping the real error in errp.

> 
> Next is independent of this patch:
> 
> > -static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name)
> > +static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name,
> > +                                         Error **errp)
> >  {
> >      RAMBlock *block = qemu_ram_block_by_name(block_name);
> >  
> >      if (!block) {
> > -        error_report("%s: invalid block name '%s'", __func__, block_name);
> > +        error_setg(errp, "MIG_RP_MSG_RECV_BITMAP has invalid block name '%s'",
> > +                   block_name);
> >          return -EINVAL;
> 
> We sent -EINVAL.
> 
> >      }
> >  
> >      /* Fetch the received bitmap and refresh the dirty bitmap */
> > -    return ram_dirty_bitmap_reload(s, block);
> > +    return ram_dirty_bitmap_reload(s, block, errp);
> >  }
> >  
> > -static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
> > +static int migrate_handle_rp_resume_ack(MigrationState *s,
> > +                                        uint32_t value, Error **errp)
> >  {
> >      trace_source_return_path_thread_resume_ack(value);
> >  
> >      if (value != MIGRATION_RESUME_ACK_VALUE) {
> > -        error_report("%s: illegal resume_ack value %"PRIu32,
> > -                     __func__, value);
> > +        error_setg(errp, "illegal resume_ack value %"PRIu32, value);
> >          return -1;
> 
> And here -1.
> 
> On both callers we just check if it is different from zero.  We never
> use the return value as errno, so I think we should move to -1, if there
> is an error, that is what errp is for.

Right.  I'll switch all rp-return thread paths to use boolean as return, as
long as there's errp.

> 
> 
> > -/* Returns 0 if the RP was ok, otherwise there was an error on the RP */
> > -static int await_return_path_close_on_source(MigrationState *ms)
> > +static void await_return_path_close_on_source(MigrationState *ms)
> >  {
> > -    int ret;
> > -
> >      if (!ms->rp_state.rp_thread_created) {
> > -        return 0;
> > +        return;
> >      }
> >  
> >      trace_migration_return_path_end_before();
> > @@ -2060,18 +2050,10 @@ static int await_return_path_close_on_source(MigrationState *ms)
> >          }
> >      }
> >  
> > -    trace_await_return_path_close_on_source_joining();
> >      qemu_thread_join(&ms->rp_state.rp_thread);
> >      ms->rp_state.rp_thread_created = false;
> > -    trace_await_return_path_close_on_source_close();
> > -
> > -    ret = ms->rp_state.error;
> > -    ms->rp_state.error = false;
> > -
> >      migration_release_dst_files(ms);
> > -
> > -    trace_migration_return_path_end_after(ret);
> > -    return ret;
> > +    trace_migration_return_path_end_after();
> >  }
> >  
> >  static inline void
> > @@ -2367,7 +2349,10 @@ static void migration_completion(MigrationState *s)
> >          goto fail;
> >      }
> >  
> > -    if (await_return_path_close_on_source(s)) {
> > +    await_return_path_close_on_source(s);
> > +
> > +    /* If return path has error, should have been set here */
> > +    if (migrate_has_error(s)) {
> >          goto fail;
> >      }
> 
> In general, I think this is bad.  We are moving for
> 
> int foo(..)
> {
> 
> }
> 
> ....
> 
> if (foo()) {
>      goto fail;
> }
> 
> to:
> 
> void foo(..)
> {
> 
> }
> 
> ....
> 
> foo();
> 
> if (bar()) {
>      goto fail;
> }
> 
> I would preffer to move the other way around.  Move the error
> synchrconously. My plan is that at some point in time
> qemu_file_get_error() dissapears, i.e. we return the error when we
> receive it and we handle it synchronously.
> 
> And yes, that is a something will take a lot of time, but I will hope we
> move on that direction, not in trusting more setting internal errors,
> use void functions and then check with yet another functions.

IIUC "synchronous" here means we can have the Error* returned from
pthread_join(), but I worry that might be too late, that the real return
path Error* doesn't get its chance to set into MigrationState.error because
there can already be some error set.

I can at least move that check into await_return_path_close_on_source()
again, so it keeps returning something.  Does that sound like okay for now?

> 
> 
> On top of your changes:
> 
> > -int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
> > +int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
> >  {
> >      int ret = -EINVAL;
> >      /* from_dst_file is always valid because we're within rp_thread */
> > @@ -4163,8 +4165,8 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
> >      trace_ram_dirty_bitmap_reload_begin(block->idstr);
> >  
> >      if (s->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
> > -        error_report("%s: incorrect state %s", __func__,
> > -                     MigrationStatus_str(s->state));
> > +        error_setg(errp, "Reload bitmap in incorrect state %s",
> > +                   MigrationStatus_str(s->state));
> >          return -EINVAL;
> 
> return -1
> 
> same for the rest of the cases. Callers only check for != 0, and if you
> want details, you need to look at errp.

I'll let them return boolean here too to be consistent.

Thanks,
Peter Xu Oct. 5, 2023, 7:35 p.m. UTC | #6
On Thu, Oct 05, 2023 at 09:57:58AM -0300, Fabiano Rosas wrote:
> > @@ -2008,9 +1996,14 @@ static void *source_return_path_thread(void *opaque)
> >      }
> >  
> >  out:
> > -    if (qemu_file_get_error(rp)) {
> > +    if (err) {
> 
> Need to keep both checks here.

The next patch did that.  Let me squash that into this..

Thanks,
Philippe Mathieu-Daudé Oct. 8, 2023, 11:39 a.m. UTC | #7
On 5/10/23 18:05, Peter Xu wrote:
> On Thu, Oct 05, 2023 at 08:11:33AM +0200, Philippe Mathieu-Daudé wrote:
>> Hi Peter,
>>
>> On 5/10/23 00:02, Peter Xu wrote:
>>> rp_state.error was a boolean used to show error happened in return path
>>> thread.  That's not only duplicating error reporting (migrate_set_error),
>>> but also not good enough in that we only do error_report() and set it to
>>> true, we never can keep a history of the exact error and show it in
>>> query-migrate.
>>>
>>> To make this better, a few things done:
>>>
>>>     - Use error_setg() rather than error_report() across the whole lifecycle
>>>       of return path thread, keeping the error in an Error*.
>>>
>>>     - Use migrate_set_error() to apply that captured error to the global
>>>       migration object when error occured in this thread.
>>>
>>>     - With above, no need to have mark_source_rp_bad(), remove it, alongside
>>>       with rp_state.error itself.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    migration/migration.h  |   1 -
>>>    migration/ram.h        |   5 +-
>>>    migration/migration.c  | 123 ++++++++++++++++++-----------------------
>>>    migration/ram.c        |  41 +++++++-------
>>>    migration/trace-events |   4 +-
>>>    5 files changed, 79 insertions(+), 95 deletions(-)
>>
>>
>>> -int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
>>> +int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
>>>    {
>>>        int ret = -EINVAL;
>>>        /* from_dst_file is always valid because we're within rp_thread */
>>
>>
>>> @@ -4193,16 +4194,16 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
>>>        ret = qemu_file_get_error(file);
>>>        if (ret || size != local_size) {
>>> -        error_report("%s: read bitmap failed for ramblock '%s': %d"
>>> -                     " (size 0x%"PRIx64", got: 0x%"PRIx64")",
>>> -                     __func__, block->idstr, ret, local_size, size);
>>> +        error_setg(errp, "read bitmap failed for ramblock '%s': %d"
>>> +                   " (size 0x%"PRIx64", got: 0x%"PRIx64")",
>>> +                   block->idstr, ret, local_size, size);
>>>            ret = -EIO;
>>>            goto out;
>>>        }
>>>        if (end_mark != RAMBLOCK_RECV_BITMAP_ENDING) {
>>> -        error_report("%s: ramblock '%s' end mark incorrect: 0x%"PRIx64,
>>> -                     __func__, block->idstr, end_mark);
>>> +        error_setg(errp, "ramblock '%s' end mark incorrect: 0x%"PRIx64,
>>> +                   block->idstr, end_mark);
>>>            ret = -EINVAL;
>>>            goto out;
>>>        }
>>
>> This function returns -EIO/-EINVAL errors, propagated to its 2 callers
>>   - migrate_handle_rp_recv_bitmap()
>>   - migrate_handle_rp_resume_ack()
> 
> It was only called in migrate_handle_rp_recv_bitmap(), but I think I see
> what you meant..
> 
>> which are only used in source_return_path_thread() where the return
>> value is only checked as boolean.
>>
>> Could we simplify them returning a boolean (which is the pattern with
>> functions taking an Error** as last parameter)?
> 
> Yes, with errp passed in, the "int" retcode is slightly duplicated.  I can
> add one more patch on top of this as further cleanup, as below.
> 
> Thanks,
> 
> ===8<===
>  From b1052befd72beb129012afddf5647339fe4e257c Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Thu, 5 Oct 2023 12:03:44 -0400
> Subject: [PATCH] migration: Change ram_dirty_bitmap_reload() retval to bool
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> Now we have a Error** passed into the return path thread stack, which is
> even clearer than an int retval.  Change ram_dirty_bitmap_reload() and the
> callers to use a bool instead to replace errnos.
> 
> Suggested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   migration/ram.h       |  2 +-
>   migration/migration.c | 18 +++++++++---------
>   migration/ram.c       | 24 +++++++++++-------------
>   3 files changed, 21 insertions(+), 23 deletions(-)
> 
> diff --git a/migration/ram.h b/migration/ram.h
> index 14ed666d58..af0290f8ab 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -72,7 +72,7 @@ void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
>   void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr, size_t nr);
>   int64_t ramblock_recv_bitmap_send(QEMUFile *file,
>                                     const char *block_name);
> -int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb, Error **errp);
> +bool ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb, Error **errp);
>   bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
>   void postcopy_preempt_shutdown_file(MigrationState *s);
>   void *postcopy_preempt_thread(void *opaque);
> diff --git a/migration/migration.c b/migration/migration.c
> index 1a7f214fcf..e7375810be 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1837,29 +1837,29 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
>       ram_save_queue_pages(rbname, start, len, errp);
>   }
>   
> -static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name,
> -                                         Error **errp)
> +static bool migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name,
> +                                          Error **errp)
>   {
>       RAMBlock *block = qemu_ram_block_by_name(block_name);
>   
>       if (!block) {
>           error_setg(errp, "MIG_RP_MSG_RECV_BITMAP has invalid block name '%s'",
>                      block_name);
> -        return -EINVAL;
> +        return false;
>       }
>   
>       /* Fetch the received bitmap and refresh the dirty bitmap */
>       return ram_dirty_bitmap_reload(s, block, errp);
>   }
>   
> -static int migrate_handle_rp_resume_ack(MigrationState *s,
> -                                        uint32_t value, Error **errp)
> +static bool migrate_handle_rp_resume_ack(MigrationState *s,
> +                                         uint32_t value, Error **errp)
>   {
>       trace_source_return_path_thread_resume_ack(value);
>   
>       if (value != MIGRATION_RESUME_ACK_VALUE) {
>           error_setg(errp, "illegal resume_ack value %"PRIu32, value);
> -        return -1;
> +        return false;
>       }
>   
>       /* Now both sides are active. */
> @@ -1869,7 +1869,7 @@ static int migrate_handle_rp_resume_ack(MigrationState *s,
>       /* Notify send thread that time to continue send pages */
>       migration_rp_kick(s);
>   
> -    return 0;
> +    return true;
>   }
>   
>   /*
> @@ -2021,14 +2021,14 @@ static void *source_return_path_thread(void *opaque)
>               }
>               /* Format: len (1B) + idstr (<255B). This ends the idstr. */
>               buf[buf[0] + 1] = '\0';
> -            if (migrate_handle_rp_recv_bitmap(ms, (char *)(buf + 1), &err)) {
> +            if (!migrate_handle_rp_recv_bitmap(ms, (char *)(buf + 1), &err)) {
>                   goto out;
>               }
>               break;
>   
>           case MIG_RP_MSG_RESUME_ACK:
>               tmp32 = ldl_be_p(buf);
> -            if (migrate_handle_rp_resume_ack(ms, tmp32, &err)) {
> +            if (!migrate_handle_rp_resume_ack(ms, tmp32, &err)) {
>                   goto out;
>               }
>               break;
> diff --git a/migration/ram.c b/migration/ram.c
> index 2565f53f5c..982fbbeee1 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -4157,23 +4157,25 @@ static int ram_dirty_bitmap_sync_all(MigrationState *s, RAMState *rs)
>    * Read the received bitmap, revert it as the initial dirty bitmap.
>    * This is only used when the postcopy migration is paused but wants
>    * to resume from a middle point.
> + *
> + * Returns true if succeeded, false for errors.
>    */
> -int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
> +bool ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
>   {
> -    int ret = -EINVAL;
>       /* from_dst_file is always valid because we're within rp_thread */
>       QEMUFile *file = s->rp_state.from_dst_file;
>       unsigned long *le_bitmap, nbits = block->used_length >> TARGET_PAGE_BITS;
>       uint64_t local_size = DIV_ROUND_UP(nbits, 8);
>       uint64_t size, end_mark;
>       RAMState *rs = ram_state;
> +    bool result = false;
>   
>       trace_ram_dirty_bitmap_reload_begin(block->idstr);
>   
>       if (s->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
>           error_setg(errp, "Reload bitmap in incorrect state %s",
>                      MigrationStatus_str(s->state));
> -        return -EINVAL;
> +        return false;
>       }
>   
>       /*
> @@ -4191,26 +4193,22 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
>       if (size != local_size) {
>           error_setg(errp, "ramblock '%s' bitmap size mismatch (0x%"PRIx64
>                      " != 0x%"PRIx64")", block->idstr, size, local_size);
> -        ret = -EINVAL;
>           goto out;
>       }
>   
>       size = qemu_get_buffer(file, (uint8_t *)le_bitmap, local_size);
>       end_mark = qemu_get_be64(file);
>   
> -    ret = qemu_file_get_error(file);
> -    if (ret || size != local_size) {
> -        error_setg(errp, "read bitmap failed for ramblock '%s': %d"
> -                   " (size 0x%"PRIx64", got: 0x%"PRIx64")",
> -                   block->idstr, ret, local_size, size);
> -        ret = -EIO;
> +    if (qemu_file_get_error(file) || size != local_size) {
> +        error_setg(errp, "read bitmap failed for ramblock '%s': "
> +                   "(size 0x%"PRIx64", got: 0x%"PRIx64")",
> +                   block->idstr, local_size, size);
>           goto out;
>       }
>   
>       if (end_mark != RAMBLOCK_RECV_BITMAP_ENDING) {
>           error_setg(errp, "ramblock '%s' end mark incorrect: 0x%"PRIx64,
>                      block->idstr, end_mark);
> -        ret = -EINVAL;
>           goto out;
>       }
>   
> @@ -4243,10 +4241,10 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
>        */
>       migration_rp_kick(s);
>   
> -    ret = 0;
> +    result = true;
>   out:
>       g_free(le_bitmap);
> -    return ret;
> +    return result;
>   }
>   
>   static int ram_resume_prepare(MigrationState *s, void *opaque)

Yes, exactly what I meant. For the embedded patch:
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

One step further is to use g_autofree for le_bitmap to remove this
annoying 'out' label. I'll send the patch.
diff mbox series

Patch

diff --git a/migration/migration.h b/migration/migration.h
index 4106a1dc54..33a7831da4 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -308,7 +308,6 @@  struct MigrationState {
         /* Protected by qemu_file_lock */
         QEMUFile     *from_dst_file;
         QemuThread    rp_thread;
-        bool          error;
         /*
          * We can also check non-zero of rp_thread, but there's no "official"
          * way to do this, so this bool makes it slightly more elegant.
diff --git a/migration/ram.h b/migration/ram.h
index 145c915ca7..14ed666d58 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -51,7 +51,8 @@  uint64_t ram_bytes_total(void);
 void mig_throttle_counter_reset(void);
 
 uint64_t ram_pagesize_summary(void);
-int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len);
+int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len,
+                         Error **errp);
 void ram_postcopy_migrated_memory_release(MigrationState *ms);
 /* For outgoing discard bitmap */
 void ram_postcopy_send_discard_bitmap(MigrationState *ms);
@@ -71,7 +72,7 @@  void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
 void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr, size_t nr);
 int64_t ramblock_recv_bitmap_send(QEMUFile *file,
                                   const char *block_name);
-int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
+int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb, Error **errp);
 bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
 void postcopy_preempt_shutdown_file(MigrationState *s);
 void *postcopy_preempt_thread(void *opaque);
diff --git a/migration/migration.c b/migration/migration.c
index 4c6de8c2dd..e821e80094 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -99,7 +99,7 @@  static int migration_maybe_pause(MigrationState *s,
                                  int *current_active_state,
                                  int new_state);
 static void migrate_fd_cancel(MigrationState *s);
-static int await_return_path_close_on_source(MigrationState *s);
+static void await_return_path_close_on_source(MigrationState *s);
 
 static bool migration_needs_multiple_sockets(void)
 {
@@ -1427,7 +1427,6 @@  int migrate_init(MigrationState *s, Error **errp)
     s->to_dst_file = NULL;
     s->state = MIGRATION_STATUS_NONE;
     s->rp_state.from_dst_file = NULL;
-    s->rp_state.error = false;
     s->mbps = 0.0;
     s->pages_per_second = 0.0;
     s->downtime = 0;
@@ -1750,16 +1749,6 @@  void qmp_migrate_continue(MigrationStatus state, Error **errp)
     qemu_sem_post(&s->pause_sem);
 }
 
-/* migration thread support */
-/*
- * Something bad happened to the RP stream, mark an error
- * The caller shall print or trace something to indicate why
- */
-static void mark_source_rp_bad(MigrationState *s)
-{
-    s->rp_state.error = true;
-}
-
 static struct rp_cmd_args {
     ssize_t     len; /* -1 = variable */
     const char *name;
@@ -1781,7 +1770,7 @@  static struct rp_cmd_args {
  * and we don't need to send pages that have already been sent.
  */
 static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
-                                       ram_addr_t start, size_t len)
+                                        ram_addr_t start, size_t len, Error **errp)
 {
     long our_host_ps = qemu_real_host_page_size();
 
@@ -1793,37 +1782,36 @@  static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
      */
     if (!QEMU_IS_ALIGNED(start, our_host_ps) ||
         !QEMU_IS_ALIGNED(len, our_host_ps)) {
-        error_report("%s: Misaligned page request, start: " RAM_ADDR_FMT
-                     " len: %zd", __func__, start, len);
-        mark_source_rp_bad(ms);
+        error_setg(errp, "MIG_RP_MSG_REQ_PAGES: Misaligned page request, start:"
+                   RAM_ADDR_FMT " len: %zd", start, len);
         return;
     }
 
-    if (ram_save_queue_pages(rbname, start, len)) {
-        mark_source_rp_bad(ms);
-    }
+    ram_save_queue_pages(rbname, start, len, errp);
 }
 
-static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name)
+static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name,
+                                         Error **errp)
 {
     RAMBlock *block = qemu_ram_block_by_name(block_name);
 
     if (!block) {
-        error_report("%s: invalid block name '%s'", __func__, block_name);
+        error_setg(errp, "MIG_RP_MSG_RECV_BITMAP has invalid block name '%s'",
+                   block_name);
         return -EINVAL;
     }
 
     /* Fetch the received bitmap and refresh the dirty bitmap */
-    return ram_dirty_bitmap_reload(s, block);
+    return ram_dirty_bitmap_reload(s, block, errp);
 }
 
-static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
+static int migrate_handle_rp_resume_ack(MigrationState *s,
+                                        uint32_t value, Error **errp)
 {
     trace_source_return_path_thread_resume_ack(value);
 
     if (value != MIGRATION_RESUME_ACK_VALUE) {
-        error_report("%s: illegal resume_ack value %"PRIu32,
-                     __func__, value);
+        error_setg(errp, "illegal resume_ack value %"PRIu32, value);
         return -1;
     }
 
@@ -1882,48 +1870,46 @@  static void *source_return_path_thread(void *opaque)
     uint32_t tmp32, sibling_error;
     ram_addr_t start = 0; /* =0 to silence warning */
     size_t  len = 0, expected_len;
+    Error *err = NULL;
     int res;
 
     trace_source_return_path_thread_entry();
     rcu_register_thread();
 
-    while (!ms->rp_state.error && !qemu_file_get_error(rp) &&
+    while (!migrate_has_error(ms) && !qemu_file_get_error(rp) &&
            migration_is_setup_or_active(ms->state)) {
         trace_source_return_path_thread_loop_top();
+
         header_type = qemu_get_be16(rp);
         header_len = qemu_get_be16(rp);
 
         if (qemu_file_get_error(rp)) {
-            mark_source_rp_bad(ms);
             goto out;
         }
 
         if (header_type >= MIG_RP_MSG_MAX ||
             header_type == MIG_RP_MSG_INVALID) {
-            error_report("RP: Received invalid message 0x%04x length 0x%04x",
-                         header_type, header_len);
-            mark_source_rp_bad(ms);
+            error_setg(&err, "Received invalid message 0x%04x length 0x%04x",
+                       header_type, header_len);
             goto out;
         }
 
         if ((rp_cmd_args[header_type].len != -1 &&
             header_len != rp_cmd_args[header_type].len) ||
             header_len > sizeof(buf)) {
-            error_report("RP: Received '%s' message (0x%04x) with"
-                         "incorrect length %d expecting %zu",
-                         rp_cmd_args[header_type].name, header_type, header_len,
-                         (size_t)rp_cmd_args[header_type].len);
-            mark_source_rp_bad(ms);
+            error_setg(&err, "Received '%s' message (0x%04x) with"
+                       "incorrect length %d expecting %zu",
+                       rp_cmd_args[header_type].name, header_type, header_len,
+                       (size_t)rp_cmd_args[header_type].len);
             goto out;
         }
 
         /* We know we've got a valid header by this point */
         res = qemu_get_buffer(rp, buf, header_len);
         if (res != header_len) {
-            error_report("RP: Failed reading data for message 0x%04x"
-                         " read %d expected %d",
-                         header_type, res, header_len);
-            mark_source_rp_bad(ms);
+            error_setg(&err, "Failed reading data for message 0x%04x"
+                       " read %d expected %d",
+                       header_type, res, header_len);
             goto out;
         }
 
@@ -1933,8 +1919,7 @@  static void *source_return_path_thread(void *opaque)
             sibling_error = ldl_be_p(buf);
             trace_source_return_path_thread_shut(sibling_error);
             if (sibling_error) {
-                error_report("RP: Sibling indicated error %d", sibling_error);
-                mark_source_rp_bad(ms);
+                error_setg(&err, "Sibling indicated error %d", sibling_error);
             }
             /*
              * We'll let the main thread deal with closing the RP
@@ -1952,7 +1937,10 @@  static void *source_return_path_thread(void *opaque)
         case MIG_RP_MSG_REQ_PAGES:
             start = ldq_be_p(buf);
             len = ldl_be_p(buf + 8);
-            migrate_handle_rp_req_pages(ms, NULL, start, len);
+            migrate_handle_rp_req_pages(ms, NULL, start, len, &err);
+            if (err) {
+                goto out;
+            }
             break;
 
         case MIG_RP_MSG_REQ_PAGES_ID:
@@ -1967,32 +1955,32 @@  static void *source_return_path_thread(void *opaque)
                 expected_len += tmp32;
             }
             if (header_len != expected_len) {
-                error_report("RP: Req_Page_id with length %d expecting %zd",
-                             header_len, expected_len);
-                mark_source_rp_bad(ms);
+                error_setg(&err, "Req_Page_id with length %d expecting %zd",
+                           header_len, expected_len);
+                goto out;
+            }
+            migrate_handle_rp_req_pages(ms, (char *)&buf[13], start, len,
+                                        &err);
+            if (err) {
                 goto out;
             }
-            migrate_handle_rp_req_pages(ms, (char *)&buf[13], start, len);
             break;
 
         case MIG_RP_MSG_RECV_BITMAP:
             if (header_len < 1) {
-                error_report("%s: missing block name", __func__);
-                mark_source_rp_bad(ms);
+                error_setg(&err, "MIG_RP_MSG_RECV_BITMAP missing block name");
                 goto out;
             }
             /* Format: len (1B) + idstr (<255B). This ends the idstr. */
             buf[buf[0] + 1] = '\0';
-            if (migrate_handle_rp_recv_bitmap(ms, (char *)(buf + 1))) {
-                mark_source_rp_bad(ms);
+            if (migrate_handle_rp_recv_bitmap(ms, (char *)(buf + 1), &err)) {
                 goto out;
             }
             break;
 
         case MIG_RP_MSG_RESUME_ACK:
             tmp32 = ldl_be_p(buf);
-            if (migrate_handle_rp_resume_ack(ms, tmp32)) {
-                mark_source_rp_bad(ms);
+            if (migrate_handle_rp_resume_ack(ms, tmp32, &err)) {
                 goto out;
             }
             break;
@@ -2008,9 +1996,14 @@  static void *source_return_path_thread(void *opaque)
     }
 
 out:
-    if (qemu_file_get_error(rp)) {
+    if (err) {
+        /*
+         * Collect any error in return-path thread and report it to the
+         * migration state object.
+         */
+        migrate_set_error(ms, err);
+        error_free(err);
         trace_source_return_path_thread_bad_end();
-        mark_source_rp_bad(ms);
     }
 
     trace_source_return_path_thread_end();
@@ -2036,13 +2029,10 @@  static int open_return_path_on_source(MigrationState *ms)
     return 0;
 }
 
-/* Returns 0 if the RP was ok, otherwise there was an error on the RP */
-static int await_return_path_close_on_source(MigrationState *ms)
+static void await_return_path_close_on_source(MigrationState *ms)
 {
-    int ret;
-
     if (!ms->rp_state.rp_thread_created) {
-        return 0;
+        return;
     }
 
     trace_migration_return_path_end_before();
@@ -2060,18 +2050,10 @@  static int await_return_path_close_on_source(MigrationState *ms)
         }
     }
 
-    trace_await_return_path_close_on_source_joining();
     qemu_thread_join(&ms->rp_state.rp_thread);
     ms->rp_state.rp_thread_created = false;
-    trace_await_return_path_close_on_source_close();
-
-    ret = ms->rp_state.error;
-    ms->rp_state.error = false;
-
     migration_release_dst_files(ms);
-
-    trace_migration_return_path_end_after(ret);
-    return ret;
+    trace_migration_return_path_end_after();
 }
 
 static inline void
@@ -2367,7 +2349,10 @@  static void migration_completion(MigrationState *s)
         goto fail;
     }
 
-    if (await_return_path_close_on_source(s)) {
+    await_return_path_close_on_source(s);
+
+    /* If return path has error, should have been set here */
+    if (migrate_has_error(s)) {
         goto fail;
     }
 
diff --git a/migration/ram.c b/migration/ram.c
index e4bfd39f08..c54e071ea3 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1951,7 +1951,8 @@  static void migration_page_queue_free(RAMState *rs)
  * @start: starting address from the start of the RAMBlock
  * @len: length (in bytes) to send
  */
-int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
+int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len,
+                         Error **errp)
 {
     RAMBlock *ramblock;
     RAMState *rs = ram_state;
@@ -1968,7 +1969,7 @@  int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
              * Shouldn't happen, we can't reuse the last RAMBlock if
              * it's the 1st request.
              */
-            error_report("ram_save_queue_pages no previous block");
+            error_setg(errp, "MIG_RP_MSG_REQ_PAGES has no previous block");
             return -1;
         }
     } else {
@@ -1976,16 +1977,17 @@  int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
 
         if (!ramblock) {
             /* We shouldn't be asked for a non-existent RAMBlock */
-            error_report("ram_save_queue_pages no block '%s'", rbname);
+            error_setg(errp, "MIG_RP_MSG_REQ_PAGES has no block '%s'", rbname);
             return -1;
         }
         rs->last_req_rb = ramblock;
     }
     trace_ram_save_queue_pages(ramblock->idstr, start, len);
     if (!offset_in_ramblock(ramblock, start + len - 1)) {
-        error_report("%s request overrun start=" RAM_ADDR_FMT " len="
-                     RAM_ADDR_FMT " blocklen=" RAM_ADDR_FMT,
-                     __func__, start, len, ramblock->used_length);
+        error_setg(errp, "MIG_RP_MSG_REQ_PAGES request overrun, "
+                   "start=" RAM_ADDR_FMT " len="
+                   RAM_ADDR_FMT " blocklen=" RAM_ADDR_FMT,
+                   start, len, ramblock->used_length);
         return -1;
     }
 
@@ -2017,9 +2019,9 @@  int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
         assert(len % page_size == 0);
         while (len) {
             if (ram_save_host_page_urgent(pss)) {
-                error_report("%s: ram_save_host_page_urgent() failed: "
-                             "ramblock=%s, start_addr=0x"RAM_ADDR_FMT,
-                             __func__, ramblock->idstr, start);
+                error_setg(errp, "ram_save_host_page_urgent() failed: "
+                           "ramblock=%s, start_addr=0x"RAM_ADDR_FMT,
+                           ramblock->idstr, start);
                 ret = -1;
                 break;
             }
@@ -4151,7 +4153,7 @@  static void ram_dirty_bitmap_reload_notify(MigrationState *s)
  * This is only used when the postcopy migration is paused but wants
  * to resume from a middle point.
  */
-int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
+int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
 {
     int ret = -EINVAL;
     /* from_dst_file is always valid because we're within rp_thread */
@@ -4163,8 +4165,8 @@  int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
     trace_ram_dirty_bitmap_reload_begin(block->idstr);
 
     if (s->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
-        error_report("%s: incorrect state %s", __func__,
-                     MigrationStatus_str(s->state));
+        error_setg(errp, "Reload bitmap in incorrect state %s",
+                   MigrationStatus_str(s->state));
         return -EINVAL;
     }
 
@@ -4181,9 +4183,8 @@  int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
 
     /* The size of the bitmap should match with our ramblock */
     if (size != local_size) {
-        error_report("%s: ramblock '%s' bitmap size mismatch "
-                     "(0x%"PRIx64" != 0x%"PRIx64")", __func__,
-                     block->idstr, size, local_size);
+        error_setg(errp, "ramblock '%s' bitmap size mismatch (0x%"PRIx64
+                   " != 0x%"PRIx64")", block->idstr, size, local_size);
         ret = -EINVAL;
         goto out;
     }
@@ -4193,16 +4194,16 @@  int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
 
     ret = qemu_file_get_error(file);
     if (ret || size != local_size) {
-        error_report("%s: read bitmap failed for ramblock '%s': %d"
-                     " (size 0x%"PRIx64", got: 0x%"PRIx64")",
-                     __func__, block->idstr, ret, local_size, size);
+        error_setg(errp, "read bitmap failed for ramblock '%s': %d"
+                   " (size 0x%"PRIx64", got: 0x%"PRIx64")",
+                   block->idstr, ret, local_size, size);
         ret = -EIO;
         goto out;
     }
 
     if (end_mark != RAMBLOCK_RECV_BITMAP_ENDING) {
-        error_report("%s: ramblock '%s' end mark incorrect: 0x%"PRIx64,
-                     __func__, block->idstr, end_mark);
+        error_setg(errp, "ramblock '%s' end mark incorrect: 0x%"PRIx64,
+                   block->idstr, end_mark);
         ret = -EINVAL;
         goto out;
     }
diff --git a/migration/trace-events b/migration/trace-events
index 002abe3a4e..5739f6b266 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -147,8 +147,6 @@  multifd_tls_outgoing_handshake_complete(void *ioc) "ioc=%p"
 multifd_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname, void *err)  "ioc=%p ioctype=%s hostname=%s err=%p"
 
 # migration.c
-await_return_path_close_on_source_close(void) ""
-await_return_path_close_on_source_joining(void) ""
 migrate_set_state(const char *new_state) "new state %s"
 migrate_fd_cleanup(void) ""
 migrate_fd_error(const char *error_desc) "error=%s"
@@ -165,7 +163,7 @@  migration_completion_postcopy_end_after_complete(void) ""
 migration_rate_limit_pre(int ms) "%d ms"
 migration_rate_limit_post(int urgent) "urgent: %d"
 migration_return_path_end_before(void) ""
-migration_return_path_end_after(int rp_error) "%d"
+migration_return_path_end_after(void) ""
 migration_thread_after_loop(void) ""
 migration_thread_file_err(void) ""
 migration_thread_setup_complete(void) ""