[2/2] migration/multifd: Fix rb->receivedmap cleanup race

Message ID	20240913220542.18305-3-farosas@suse.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Fabiano Rosas <farosas@suse.de> To: qemu-devel@nongnu.org Cc: Peter Xu <peterx@redhat.com>, Peter Maydell <peter.maydell@linaro.org>, qemu-stable@nongnu.org Subject: [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race Date: Fri, 13 Sep 2024 19:05:42 -0300 Message-Id: <20240913220542.18305-3-farosas@suse.de> In-Reply-To: <20240913220542.18305-1-farosas@suse.de> References: <20240913220542.18305-1-farosas@suse.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2a07:de40:b251:101:10:150:64:1; envelope-from=farosas@suse.de; helo=smtp-out1.suse.de X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	migration/multifd: Fix rb->receivedmap cleanup race \| expand [0/2] migration/multifd: Fix rb->receivedmap cleanup race [1/2] migration/savevm: Remove extra load cleanup calls [2/2] migration/multifd: Fix rb->receivedmap cleanup race

Message ID

20240913220542.18305-3-farosas@suse.de (mailing list archive)

State

New, archived

Headers

From: Fabiano Rosas <farosas@suse.de>
To: qemu-devel@nongnu.org
Cc: Peter Xu <peterx@redhat.com>, Peter Maydell <peter.maydell@linaro.org>,
 qemu-stable@nongnu.org
Subject: [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race
Date: Fri, 13 Sep 2024 19:05:42 -0300
Message-Id: <20240913220542.18305-3-farosas@suse.de>
In-Reply-To: <20240913220542.18305-1-farosas@suse.de>
References: <20240913220542.18305-1-farosas@suse.de>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=2a07:de40:b251:101:10:150:64:1;
 envelope-from=farosas@suse.de; helo=smtp-out1.suse.de
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org

Series

migration/multifd: Fix rb->receivedmap cleanup race | expand

Commit Message

Fabiano Rosas Sept. 13, 2024, 10:05 p.m. UTC

Fix a segmentation fault in multifd when rb->receivedmap is cleared
too early.

After commit 5ef7e26bdb ("migration/multifd: solve zero page causing
multiple page faults"), multifd started using the rb->receivedmap
bitmap, which belongs to ram.c and is initialized and *freed* from the
ram SaveVMHandlers.

Multifd threads are live until migration_incoming_state_destroy(),
which is called after qemu_loadvm_state_cleanup(), leading to a crash
when accessing rb->receivedmap.

process_incoming_migration_co()        ...
  qemu_loadvm_state()                  multifd_nocomp_recv()
    qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
      rb->receivedmap = NULL               set_bit_atomic(..., rb->receivedmap)
  ...
  migration_incoming_state_destroy()
    multifd_recv_cleanup()
      multifd_recv_terminate_threads(NULL)

Move the loadvm cleanup into migration_incoming_state_destroy(), after
multifd_recv_cleanup() to ensure multifd thread have already exited
when rb->receivedmap is cleared.

The have_listen_thread logic can now be removed because its purpose
was to delay cleanup until postcopy_ram_listen_thread() had finished.

CC: qemu-stable@nongnu.org
Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults")
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 1 +
 migration/migration.h | 1 -
 migration/savevm.c    | 9 ---------
 3 files changed, 1 insertion(+), 10 deletions(-)

Comments

Peter Xu Sept. 17, 2024, 5:02 p.m. UTC | #1

On Fri, Sep 13, 2024 at 07:05:42PM -0300, Fabiano Rosas wrote:
> Fix a segmentation fault in multifd when rb->receivedmap is cleared
> too early.
> 
> After commit 5ef7e26bdb ("migration/multifd: solve zero page causing
> multiple page faults"), multifd started using the rb->receivedmap
> bitmap, which belongs to ram.c and is initialized and *freed* from the
> ram SaveVMHandlers.
> 
> Multifd threads are live until migration_incoming_state_destroy(),
> which is called after qemu_loadvm_state_cleanup(), leading to a crash
> when accessing rb->receivedmap.
> 
> process_incoming_migration_co()        ...
>   qemu_loadvm_state()                  multifd_nocomp_recv()
>     qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
>       rb->receivedmap = NULL               set_bit_atomic(..., rb->receivedmap)
>   ...
>   migration_incoming_state_destroy()
>     multifd_recv_cleanup()
>       multifd_recv_terminate_threads(NULL)
> 
> Move the loadvm cleanup into migration_incoming_state_destroy(), after
> multifd_recv_cleanup() to ensure multifd thread have already exited
> when rb->receivedmap is cleared.
> 
> The have_listen_thread logic can now be removed because its purpose
> was to delay cleanup until postcopy_ram_listen_thread() had finished.
> 
> CC: qemu-stable@nongnu.org
> Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults")
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/migration.c | 1 +
>  migration/migration.h | 1 -
>  migration/savevm.c    | 9 ---------
>  3 files changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 3dea06d577..b190a574b1 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void)
>      struct MigrationIncomingState *mis = migration_incoming_get_current();
>  
>      multifd_recv_cleanup();
> +    qemu_loadvm_state_cleanup();
>  
>      if (mis->to_src_file) {
>          /* Tell source that we are done */
> diff --git a/migration/migration.h b/migration/migration.h
> index 38aa1402d5..20b0a5b66e 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -101,7 +101,6 @@ struct MigrationIncomingState {
>      /* Set this when we want the fault thread to quit */
>      bool           fault_thread_quit;
>  
> -    bool           have_listen_thread;
>      QemuThread     listen_thread;
>  
>      /* For the kernel to send us notifications */
> diff --git a/migration/savevm.c b/migration/savevm.c
> index d0759694fd..532ee5e4b0 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2076,10 +2076,8 @@ static void *postcopy_ram_listen_thread(void *opaque)
>       * got a bad migration state).
>       */
>      migration_incoming_state_destroy();
> -    qemu_loadvm_state_cleanup();
>  
>      rcu_unregister_thread();
> -    mis->have_listen_thread = false;
>      postcopy_state_set(POSTCOPY_INCOMING_END);
>  
>      object_unref(OBJECT(migr));
> @@ -2130,7 +2128,6 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>          return -1;
>      }
>  
> -    mis->have_listen_thread = true;
>      postcopy_thread_create(mis, &mis->listen_thread, "mig/dst/listen",
>                             postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
>      trace_loadvm_postcopy_handle_listen("return");
> @@ -2978,11 +2975,6 @@ int qemu_loadvm_state(QEMUFile *f)
>  
>      trace_qemu_loadvm_state_post_main(ret);
>  
> -    if (mis->have_listen_thread) {
> -        /* Listen thread still going, can't clean up yet */
> -        return ret;
> -    }

Hmm, I wonder whether we would still need this.  IIUC it's not only about
cleanup, but also that when postcopy is involved, dst QEMU postpones doing
any of the rest in the qemu_loadvm_state_main() call.

E.g. cpu put, aka, cpu_synchronize_all_post_init(), is also done in
loadvm_postcopy_handle_run_bh() later.

IOW, I'd then expect when this patch applied we'll put cpu twice?

I think the should_send_vmdesc() part is fine, as it returns false for
postcopy anyway.  However not sure on the cpu post_init above.

> -
>      if (ret == 0) {
>          ret = qemu_file_get_error(f);
>      }
> @@ -3022,7 +3014,6 @@ int qemu_loadvm_state(QEMUFile *f)
>          }
>      }
>  
> -    qemu_loadvm_state_cleanup();
>      cpu_synchronize_all_post_init();
>  
>      return ret;
> -- 
> 2.35.3
>

Fabiano Rosas Sept. 17, 2024, 5:41 p.m. UTC | #2

Peter Xu <peterx@redhat.com> writes:

> On Fri, Sep 13, 2024 at 07:05:42PM -0300, Fabiano Rosas wrote:
>> Fix a segmentation fault in multifd when rb->receivedmap is cleared
>> too early.
>> 
>> After commit 5ef7e26bdb ("migration/multifd: solve zero page causing
>> multiple page faults"), multifd started using the rb->receivedmap
>> bitmap, which belongs to ram.c and is initialized and *freed* from the
>> ram SaveVMHandlers.
>> 
>> Multifd threads are live until migration_incoming_state_destroy(),
>> which is called after qemu_loadvm_state_cleanup(), leading to a crash
>> when accessing rb->receivedmap.
>> 
>> process_incoming_migration_co()        ...
>>   qemu_loadvm_state()                  multifd_nocomp_recv()
>>     qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
>>       rb->receivedmap = NULL               set_bit_atomic(..., rb->receivedmap)
>>   ...
>>   migration_incoming_state_destroy()
>>     multifd_recv_cleanup()
>>       multifd_recv_terminate_threads(NULL)
>> 
>> Move the loadvm cleanup into migration_incoming_state_destroy(), after
>> multifd_recv_cleanup() to ensure multifd thread have already exited
>> when rb->receivedmap is cleared.
>> 
>> The have_listen_thread logic can now be removed because its purpose
>> was to delay cleanup until postcopy_ram_listen_thread() had finished.
>> 
>> CC: qemu-stable@nongnu.org
>> Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults")
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  migration/migration.c | 1 +
>>  migration/migration.h | 1 -
>>  migration/savevm.c    | 9 ---------
>>  3 files changed, 1 insertion(+), 10 deletions(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 3dea06d577..b190a574b1 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void)
>>      struct MigrationIncomingState *mis = migration_incoming_get_current();
>>  
>>      multifd_recv_cleanup();
>> +    qemu_loadvm_state_cleanup();
>>  
>>      if (mis->to_src_file) {
>>          /* Tell source that we are done */
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 38aa1402d5..20b0a5b66e 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -101,7 +101,6 @@ struct MigrationIncomingState {
>>      /* Set this when we want the fault thread to quit */
>>      bool           fault_thread_quit;
>>  
>> -    bool           have_listen_thread;
>>      QemuThread     listen_thread;
>>  
>>      /* For the kernel to send us notifications */
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index d0759694fd..532ee5e4b0 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2076,10 +2076,8 @@ static void *postcopy_ram_listen_thread(void *opaque)
>>       * got a bad migration state).
>>       */
>>      migration_incoming_state_destroy();
>> -    qemu_loadvm_state_cleanup();
>>  
>>      rcu_unregister_thread();
>> -    mis->have_listen_thread = false;
>>      postcopy_state_set(POSTCOPY_INCOMING_END);
>>  
>>      object_unref(OBJECT(migr));
>> @@ -2130,7 +2128,6 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>>          return -1;
>>      }
>>  
>> -    mis->have_listen_thread = true;
>>      postcopy_thread_create(mis, &mis->listen_thread, "mig/dst/listen",
>>                             postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
>>      trace_loadvm_postcopy_handle_listen("return");
>> @@ -2978,11 +2975,6 @@ int qemu_loadvm_state(QEMUFile *f)
>>  
>>      trace_qemu_loadvm_state_post_main(ret);
>>  
>> -    if (mis->have_listen_thread) {
>> -        /* Listen thread still going, can't clean up yet */
>> -        return ret;
>> -    }
>
> Hmm, I wonder whether we would still need this.  IIUC it's not only about
> cleanup, but also that when postcopy is involved, dst QEMU postpones doing
> any of the rest in the qemu_loadvm_state_main() call.
>
> E.g. cpu put, aka, cpu_synchronize_all_post_init(), is also done in
> loadvm_postcopy_handle_run_bh() later.
>
> IOW, I'd then expect when this patch applied we'll put cpu twice?
>
> I think the should_send_vmdesc() part is fine, as it returns false for
> postcopy anyway.  However not sure on the cpu post_init above.

I'm not sure either, but there's several ioctls in there, so it's
probably better to skip them. I'll keep the have_listen and adjust the
comment.

Elena Ufimtseva Sept. 20, 2024, 6:55 p.m. UTC | #3

On Fri, Sep 13, 2024 at 3:07 PM Fabiano Rosas <farosas@suse.de> wrote:

> Fix a segmentation fault in multifd when rb->receivedmap is cleared
> too early.
>
> After commit 5ef7e26bdb ("migration/multifd: solve zero page causing
> multiple page faults"), multifd started using the rb->receivedmap
> bitmap, which belongs to ram.c and is initialized and *freed* from the
> ram SaveVMHandlers.
>
> Multifd threads are live until migration_incoming_state_destroy(),
> which is called after qemu_loadvm_state_cleanup(), leading to a crash
> when accessing rb->receivedmap.
>
> process_incoming_migration_co()        ...
>   qemu_loadvm_state()                  multifd_nocomp_recv()
>     qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
>       rb->receivedmap = NULL               set_bit_atomic(...,
> rb->receivedmap)
>   ...
>   migration_incoming_state_destroy()
>     multifd_recv_cleanup()
>       multifd_recv_terminate_threads(NULL)
>
> Move the loadvm cleanup into migration_incoming_state_destroy(), after
> multifd_recv_cleanup() to ensure multifd thread have already exited
> when rb->receivedmap is cleared.
>
> The have_listen_thread logic can now be removed because its purpose
> was to delay cleanup until postcopy_ram_listen_thread() had finished.
>
> CC: qemu-stable@nongnu.org
> Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple
> page faults")
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/migration.c | 1 +
>  migration/migration.h | 1 -
>  migration/savevm.c    | 9 ---------
>  3 files changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 3dea06d577..b190a574b1 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void)
>      struct MigrationIncomingState *mis = migration_incoming_get_current();
>
>      multifd_recv_cleanup();
> +    qemu_loadvm_state_cleanup();
>
>      if (mis->to_src_file) {
>          /* Tell source that we are done */
> diff --git a/migration/migration.h b/migration/migration.h
> index 38aa1402d5..20b0a5b66e 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -101,7 +101,6 @@ struct MigrationIncomingState {
>      /* Set this when we want the fault thread to quit */
>      bool           fault_thread_quit;
>
> -    bool           have_listen_thread;
>      QemuThread     listen_thread;
>
>      /* For the kernel to send us notifications */
> diff --git a/migration/savevm.c b/migration/savevm.c
> index d0759694fd..532ee5e4b0 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2076,10 +2076,8 @@ static void *postcopy_ram_listen_thread(void
> *opaque)
>       * got a bad migration state).
>       */
>      migration_incoming_state_destroy();
> -    qemu_loadvm_state_cleanup();
>
>      rcu_unregister_thread();
> -    mis->have_listen_thread = false;
>      postcopy_state_set(POSTCOPY_INCOMING_END);
>
>      object_unref(OBJECT(migr));
> @@ -2130,7 +2128,6 @@ static int
> loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>          return -1;
>      }
>
> -    mis->have_listen_thread = true;
>      postcopy_thread_create(mis, &mis->listen_thread, "mig/dst/listen",
>                             postcopy_ram_listen_thread,
> QEMU_THREAD_DETACHED);
>      trace_loadvm_postcopy_handle_listen("return");
> @@ -2978,11 +2975,6 @@ int qemu_loadvm_state(QEMUFile *f)
>
>      trace_qemu_loadvm_state_post_main(ret);
>
> -    if (mis->have_listen_thread) {
> -        /* Listen thread still going, can't clean up yet */
> -        return ret;
> -    }
> -
>      if (ret == 0) {
>          ret = qemu_file_get_error(f);
>      }
> @@ -3022,7 +3014,6 @@ int qemu_loadvm_state(QEMUFile *f)
>          }
>      }
>
> -    qemu_loadvm_state_cleanup();
>      cpu_synchronize_all_post_init();
>


Hi Fabiano

I have a question. By removing  qemu_loadvm_state_cleanup() here, the
failure path that ends up with exit(EXIT_FAILURE)
in process_incoming_migration_co() end up not calling the
qemu_loadvm_state_cleanup(). I am not sure how this is important since
there is exit, but the
vfio, for example, will not call the VF reset.

Another more general question is why destination Qemu has to terminate
there if there was an error detected during live migration?
Could just failing the migration and leave destination running be a more
expected scenario?

Thank you!

     return ret;
> --
> 2.35.3
>
>
>

Fabiano Rosas Oct. 8, 2024, 9:36 p.m. UTC | #4

Elena Ufimtseva <ufimtseva@gmail.com> writes:

> On Fri, Sep 13, 2024 at 3:07 PM Fabiano Rosas <farosas@suse.de> wrote:
>
>> Fix a segmentation fault in multifd when rb->receivedmap is cleared
>> too early.
>>
>> After commit 5ef7e26bdb ("migration/multifd: solve zero page causing
>> multiple page faults"), multifd started using the rb->receivedmap
>> bitmap, which belongs to ram.c and is initialized and *freed* from the
>> ram SaveVMHandlers.
>>
>> Multifd threads are live until migration_incoming_state_destroy(),
>> which is called after qemu_loadvm_state_cleanup(), leading to a crash
>> when accessing rb->receivedmap.
>>
>> process_incoming_migration_co()        ...
>>   qemu_loadvm_state()                  multifd_nocomp_recv()
>>     qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
>>       rb->receivedmap = NULL               set_bit_atomic(...,
>> rb->receivedmap)
>>   ...
>>   migration_incoming_state_destroy()
>>     multifd_recv_cleanup()
>>       multifd_recv_terminate_threads(NULL)
>>
>> Move the loadvm cleanup into migration_incoming_state_destroy(), after
>> multifd_recv_cleanup() to ensure multifd thread have already exited
>> when rb->receivedmap is cleared.
>>
>> The have_listen_thread logic can now be removed because its purpose
>> was to delay cleanup until postcopy_ram_listen_thread() had finished.
>>
>> CC: qemu-stable@nongnu.org
>> Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple
>> page faults")
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  migration/migration.c | 1 +
>>  migration/migration.h | 1 -
>>  migration/savevm.c    | 9 ---------
>>  3 files changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 3dea06d577..b190a574b1 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void)
>>      struct MigrationIncomingState *mis = migration_incoming_get_current();
>>
>>      multifd_recv_cleanup();
>> +    qemu_loadvm_state_cleanup();
>>
>>      if (mis->to_src_file) {
>>          /* Tell source that we are done */
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 38aa1402d5..20b0a5b66e 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -101,7 +101,6 @@ struct MigrationIncomingState {
>>      /* Set this when we want the fault thread to quit */
>>      bool           fault_thread_quit;
>>
>> -    bool           have_listen_thread;
>>      QemuThread     listen_thread;
>>
>>      /* For the kernel to send us notifications */
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index d0759694fd..532ee5e4b0 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2076,10 +2076,8 @@ static void *postcopy_ram_listen_thread(void
>> *opaque)
>>       * got a bad migration state).
>>       */
>>      migration_incoming_state_destroy();
>> -    qemu_loadvm_state_cleanup();
>>
>>      rcu_unregister_thread();
>> -    mis->have_listen_thread = false;
>>      postcopy_state_set(POSTCOPY_INCOMING_END);
>>
>>      object_unref(OBJECT(migr));
>> @@ -2130,7 +2128,6 @@ static int
>> loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>>          return -1;
>>      }
>>
>> -    mis->have_listen_thread = true;
>>      postcopy_thread_create(mis, &mis->listen_thread, "mig/dst/listen",
>>                             postcopy_ram_listen_thread,
>> QEMU_THREAD_DETACHED);
>>      trace_loadvm_postcopy_handle_listen("return");
>> @@ -2978,11 +2975,6 @@ int qemu_loadvm_state(QEMUFile *f)
>>
>>      trace_qemu_loadvm_state_post_main(ret);
>>
>> -    if (mis->have_listen_thread) {
>> -        /* Listen thread still going, can't clean up yet */
>> -        return ret;
>> -    }
>> -
>>      if (ret == 0) {
>>          ret = qemu_file_get_error(f);
>>      }
>> @@ -3022,7 +3014,6 @@ int qemu_loadvm_state(QEMUFile *f)
>>          }
>>      }
>>
>> -    qemu_loadvm_state_cleanup();
>>      cpu_synchronize_all_post_init();
>>
>
>
> Hi Fabiano

Hi, sorry for the delay.

>
> I have a question. By removing  qemu_loadvm_state_cleanup() here, the
> failure path that ends up with exit(EXIT_FAILURE)
> in process_incoming_migration_co() end up not calling the
> qemu_loadvm_state_cleanup(). I am not sure how this is important since
> there is exit, but the
> vfio, for example, will not call the VF reset.

In the fail label, migration_incoming_state_destroy() is called right
before the exit block.

>
> Another more general question is why destination Qemu has to terminate
> there if there was an error detected during live migration?
> Could just failing the migration and leave destination running be a more
> expected scenario?

After failure to load, the destination VM is not really usable, so
terminating it seems ok. This was changed with the addition of
mis->exit_on_error in dbea1c89da ("qapi: introduce exit-on-error
parameter for migrate-incoming"). We still exit on error by default, but
allow the user to modify the behavior to allow management applications
to collect the error or inspect the VM.

>
> Thank you!
>
>      return ret;
>> --
>> 2.35.3
>>
>>
>>

diff --git a/migration/migration.c b/migration/migration.c
index 3dea06d577..b190a574b1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -378,6 +378,7 @@  void migration_incoming_state_destroy(void)
     struct MigrationIncomingState *mis = migration_incoming_get_current();
 
     multifd_recv_cleanup();
+    qemu_loadvm_state_cleanup();
 
     if (mis->to_src_file) {
         /* Tell source that we are done */
diff --git a/migration/migration.h b/migration/migration.h
index 38aa1402d5..20b0a5b66e 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -101,7 +101,6 @@  struct MigrationIncomingState {
     /* Set this when we want the fault thread to quit */
     bool           fault_thread_quit;
 
-    bool           have_listen_thread;
     QemuThread     listen_thread;
 
     /* For the kernel to send us notifications */
diff --git a/migration/savevm.c b/migration/savevm.c
index d0759694fd..532ee5e4b0 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2076,10 +2076,8 @@  static void *postcopy_ram_listen_thread(void *opaque)
      * got a bad migration state).
      */
     migration_incoming_state_destroy();
-    qemu_loadvm_state_cleanup();
 
     rcu_unregister_thread();
-    mis->have_listen_thread = false;
     postcopy_state_set(POSTCOPY_INCOMING_END);
 
     object_unref(OBJECT(migr));
@@ -2130,7 +2128,6 @@  static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
         return -1;
     }
 
-    mis->have_listen_thread = true;
     postcopy_thread_create(mis, &mis->listen_thread, "mig/dst/listen",
                            postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
     trace_loadvm_postcopy_handle_listen("return");
@@ -2978,11 +2975,6 @@  int qemu_loadvm_state(QEMUFile *f)
 
     trace_qemu_loadvm_state_post_main(ret);
 
-    if (mis->have_listen_thread) {
-        /* Listen thread still going, can't clean up yet */
-        return ret;
-    }
-
     if (ret == 0) {
         ret = qemu_file_get_error(f);
     }
@@ -3022,7 +3014,6 @@  int qemu_loadvm_state(QEMUFile *f)
         }
     }
 
-    qemu_loadvm_state_cleanup();
     cpu_synchronize_all_post_init();
 
     return ret;

[2/2] migration/multifd: Fix rb->receivedmap cleanup race

Commit Message

Comments

Patch