diff mbox series

[RFC,2/2] migration/multifd: Move semaphore release into main thread

Message ID 20231109165856.15224-3-farosas@suse.de (mailing list archive)
State New, archived
Headers show
Series migration: Fix multifd qemu_mutex_destroy race | expand

Commit Message

Fabiano Rosas Nov. 9, 2023, 4:58 p.m. UTC
We cannot operate on the multifd semaphores outside of the multifd
channel thread because multifd_save_cleanup() can run in parallel and
attempt to destroy the mutexes, which causes an assert.

Looking at the places where we use the semaphores aside from the
migration thread, there's only the TLS handshake thread and the
initial multifd_channel_connect() in the main thread. These are places
where creating the multifd channel cannot fail, so releasing the
semaphores at these places only serves the purpose of avoiding a
deadlock when an error happens before the channel(s) have started, but
after the migration thread has already called
multifd_send_sync_main().

Instead of attempting to release the semaphores at those places, move
the release into multifd_save_cleanup(). This puts the semaphore usage
in the same thread that does the cleanup, eliminating the race.

Move the call to multifd_save_cleanup() before joining for the
migration thread so we release the semaphores before.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c |  4 +++-
 migration/multifd.c   | 29 +++++++++++------------------
 2 files changed, 14 insertions(+), 19 deletions(-)

Comments

Peter Xu Nov. 9, 2023, 6:56 p.m. UTC | #1
On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote:
> We cannot operate on the multifd semaphores outside of the multifd
> channel thread
> because multifd_save_cleanup() can run in parallel and
> attempt to destroy the mutexes, which causes an assert.
> 
> Looking at the places where we use the semaphores aside from the
> migration thread, there's only the TLS handshake thread and the
> initial multifd_channel_connect() in the main thread. These are places
> where creating the multifd channel cannot fail, so releasing the
> semaphores at these places only serves the purpose of avoiding a
> deadlock when an error happens before the channel(s) have started, but
> after the migration thread has already called
> multifd_send_sync_main().
> 
> Instead of attempting to release the semaphores at those places, move
> the release into multifd_save_cleanup(). This puts the semaphore usage
> in the same thread that does the cleanup, eliminating the race.
> 
> Move the call to multifd_save_cleanup() before joining for the
> migration thread so we release the semaphores before.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/migration.c |  4 +++-
>  migration/multifd.c   | 29 +++++++++++------------------
>  2 files changed, 14 insertions(+), 19 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index cca32c553c..52be20561b 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>          QEMUFile *tmp;
>  
>          trace_migrate_fd_cleanup();
> +
> +        multifd_save_cleanup();
> +
>          qemu_mutex_unlock_iothread();
>          if (s->migration_thread_running) {
>              qemu_thread_join(&s->thread);
> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s)
>          }
>          qemu_mutex_lock_iothread();
>  
> -        multifd_save_cleanup();
>          qemu_mutex_lock(&s->qemu_file_lock);
>          tmp = s->to_dst_file;
>          s->to_dst_file = NULL;
> diff --git a/migration/multifd.c b/migration/multifd.c
> index ec58c58082..9ca482cfbe 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void)
>  
>          if (p->running) {
>              qemu_thread_join(&p->thread);
> +        } else {
> +            qemu_sem_post(&p->sem_sync);
> +            qemu_sem_post(&multifd_send_state->channels_ready);

I think relying on p->running to join the thread is already problematic.

Now all threads are created with JOINABLE, so we must join them to release
the thread resources.  Clearing "running" at the end of the thread is
already wrong to me, because it means if the thread quits before the main
thread reaching here, we will not join the thread, and the thread resource
will be leaked.

Here IMHO we should set p->running=true right before thread created, and
never clear it.  We may even want to rename it to p->thread_created?

>          }
>      }
>      for (i = 0; i < migrate_multifd_channels(); i++) {
> @@ -751,8 +754,6 @@ out:
>          assert(local_err);
>          trace_multifd_send_error(p->id);
>          multifd_send_terminate_threads(local_err);
> -        qemu_sem_post(&p->sem_sync);
> -        qemu_sem_post(&multifd_send_state->channels_ready);
>          error_free(local_err);
>      }
>  
> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>  
>      if (!qio_task_propagate_error(task, &err)) {
>          trace_multifd_tls_outgoing_handshake_complete(ioc);
> -        if (multifd_channel_connect(p, ioc, &err)) {
> -            return;
> -        }
> +        multifd_channel_connect(p, ioc, NULL);

Ignoring Error** is not good..

I think you meant to say "it should never fail", then we should put
&error_abort.  Another cleaner way to do this is split the current
multifd_channel_connect() into tls and non-tls helpers, then we can call
the non-tls helpers here (which may not need an Error**).

> +    } else {
> +        /*
> +         * The multifd client could already be waiting to queue data,
> +         * so let it know that we didn't even start.
> +         */
> +        p->quit = true;
> +        trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>      }
> -
> -    trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
> -
> -    /*
> -     * Error happen, mark multifd_send_thread status as 'quit' although it
> -     * is not created, and then tell who pay attention to me.
> -     */
> -    p->quit = true;
> -    qemu_sem_post(&multifd_send_state->channels_ready);
> -    qemu_sem_post(&p->sem_sync);
>  }
>  
>  static void *multifd_tls_handshake_thread(void *opaque)
> @@ -862,9 +858,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
>                                               QIOChannel *ioc, Error *err)
>  {
>       migrate_set_error(migrate_get_current(), err);
> -     /* Error happen, we need to tell who pay attention to me */
> -     qemu_sem_post(&multifd_send_state->channels_ready);
> -     qemu_sem_post(&p->sem_sync);
>       /*
>        * Although multifd_send_thread is not created, but main migration
>        * thread need to judge whether it is running, so we need to mark
> -- 

I may still need some more time to digest your whole solution, currently
not very clear to me.  It may or may not be the patch's problem, though.

But let me also share how I see this..  I think we should rework the
multifd thread model on channel establishment.

Firstly, as I mentioned above, we must always join() the threads if it's
JOINABLE or the resource is leaked, afaict.  That's the first thing to fix.

Then let's see when TLS is there and what we do: we'll create "two" threads
just for that, what's even worse:

  - We'll create tls handshake thread in multifd_tls_channel_connect()
    first, setting &p->thread.

  - Within the same thread, we do multifd_tls_outgoing_handshake() when
    handshake done -> multifd_channel_connect() -> we yet create the real
    multifd_send_thread(), setting &p->thread too.

So AFAICT, the tls handshake thread is already leaked, got p->thread
overwritten by the new thread created by itself..

I think we should fix this then.  I haven't figured the best way to do,
two things I can come up with now:

  1) At least make the tls handshake thread detached.

  2) Make the handshake done in multifd send thread directly; I don't yet
     see why we must create two threads..

Then assuming we have a clear model with all these threads issue fixed (no
matter whether we'd shrink 2N threads into N threads), then what we need to
do, IMHO, is making sure to join() all of them before destroying anything
(say, per-channel MultiFDSendParams).  Then when we destroy everything
safely, either mutex/sem/etc..  Because no one will race us anymore.

Would it make sense?

Thanks,
Fabiano Rosas Nov. 10, 2023, 12:05 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote:
>> We cannot operate on the multifd semaphores outside of the multifd
>> channel thread
>> because multifd_save_cleanup() can run in parallel and
>> attempt to destroy the mutexes, which causes an assert.
>> 
>> Looking at the places where we use the semaphores aside from the
>> migration thread, there's only the TLS handshake thread and the
>> initial multifd_channel_connect() in the main thread. These are places
>> where creating the multifd channel cannot fail, so releasing the
>> semaphores at these places only serves the purpose of avoiding a
>> deadlock when an error happens before the channel(s) have started, but
>> after the migration thread has already called
>> multifd_send_sync_main().
>> 
>> Instead of attempting to release the semaphores at those places, move
>> the release into multifd_save_cleanup(). This puts the semaphore usage
>> in the same thread that does the cleanup, eliminating the race.
>> 
>> Move the call to multifd_save_cleanup() before joining for the
>> migration thread so we release the semaphores before.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  migration/migration.c |  4 +++-
>>  migration/multifd.c   | 29 +++++++++++------------------
>>  2 files changed, 14 insertions(+), 19 deletions(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index cca32c553c..52be20561b 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>>          QEMUFile *tmp;
>>  
>>          trace_migrate_fd_cleanup();
>> +
>> +        multifd_save_cleanup();
>> +
>>          qemu_mutex_unlock_iothread();
>>          if (s->migration_thread_running) {
>>              qemu_thread_join(&s->thread);
>> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s)
>>          }
>>          qemu_mutex_lock_iothread();
>>  
>> -        multifd_save_cleanup();
>>          qemu_mutex_lock(&s->qemu_file_lock);
>>          tmp = s->to_dst_file;
>>          s->to_dst_file = NULL;
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index ec58c58082..9ca482cfbe 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void)
>>  
>>          if (p->running) {
>>              qemu_thread_join(&p->thread);
>> +        } else {
>> +            qemu_sem_post(&p->sem_sync);
>> +            qemu_sem_post(&multifd_send_state->channels_ready);
>
> I think relying on p->running to join the thread is already problematic.
>

Absolutely. I've already complained about this in another thread.

> Now all threads are created with JOINABLE, so we must join them to release
> the thread resources.  Clearing "running" at the end of the thread is
> already wrong to me, because it means if the thread quits before the main
> thread reaching here, we will not join the thread, and the thread resource
> will be leaked.
>
> Here IMHO we should set p->running=true right before thread created, and
> never clear it.  We may even want to rename it to p->thread_created?
>

Ok, let me do that. I already have some patches touching
multifd_new_send_channel_async() due to fixed-ram.

>>          }
>>      }
>>      for (i = 0; i < migrate_multifd_channels(); i++) {
>> @@ -751,8 +754,6 @@ out:
>>          assert(local_err);
>>          trace_multifd_send_error(p->id);
>>          multifd_send_terminate_threads(local_err);
>> -        qemu_sem_post(&p->sem_sync);
>> -        qemu_sem_post(&multifd_send_state->channels_ready);
>>          error_free(local_err);
>>      }
>>  
>> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>>  
>>      if (!qio_task_propagate_error(task, &err)) {
>>          trace_multifd_tls_outgoing_handshake_complete(ioc);
>> -        if (multifd_channel_connect(p, ioc, &err)) {
>> -            return;
>> -        }
>> +        multifd_channel_connect(p, ioc, NULL);
>
> Ignoring Error** is not good..
>
> I think you meant to say "it should never fail", then we should put
> &error_abort.  Another cleaner way to do this is split the current
> multifd_channel_connect() into tls and non-tls helpers, then we can call
> the non-tls helpers here (which may not need an Error**).

I attempted the split and it looked awkward, so I let go. But I agree
about the Error.

>> +    } else {
>> +        /*
>> +         * The multifd client could already be waiting to queue data,
>> +         * so let it know that we didn't even start.
>> +         */
>> +        p->quit = true;
>> +        trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>>      }
>> -
>> -    trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>> -
>> -    /*
>> -     * Error happen, mark multifd_send_thread status as 'quit' although it
>> -     * is not created, and then tell who pay attention to me.
>> -     */
>> -    p->quit = true;
>> -    qemu_sem_post(&multifd_send_state->channels_ready);
>> -    qemu_sem_post(&p->sem_sync);
>>  }
>>  
>>  static void *multifd_tls_handshake_thread(void *opaque)
>> @@ -862,9 +858,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
>>                                               QIOChannel *ioc, Error *err)
>>  {
>>       migrate_set_error(migrate_get_current(), err);
>> -     /* Error happen, we need to tell who pay attention to me */
>> -     qemu_sem_post(&multifd_send_state->channels_ready);
>> -     qemu_sem_post(&p->sem_sync);
>>       /*
>>        * Although multifd_send_thread is not created, but main migration
>>        * thread need to judge whether it is running, so we need to mark
>> -- 
>
> I may still need some more time to digest your whole solution, currently
> not very clear to me.  It may or may not be the patch's problem,
> though.

The core assumption of this patch is that we currently only need to
release the semaphores because the multifd_send_sync_main() is allowed
to execute even before the multifd channels have started. If creation
failed to even start, for instance because of a TLS handshake failure,
the migration thread will hang waiting for channels_ready (or sem_sync).

Then there are two premises:

 - when an error happens, multifd_save_cleanup() is always called;
 
 - a hanging migration thread (at multifd_send_sync_main) will not stop
   the main thread from reaching multifd_save_cleanup.

If this holds, then it's safe to release the semaphores right before
destroying them at multifd_save_cleanup.

> But let me also share how I see this..  I think we should rework the
> multifd thread model on channel establishment.
>
> Firstly, as I mentioned above, we must always join() the threads if it's
> JOINABLE or the resource is leaked, afaict.  That's the first thing to fix.

I think we historically stumbled upon the fact that qemu_thread_join()
is not the same as pthread_join(). The former takes a pointer and is not
safe to call with a NULL QemuThread. That seems to be the reason for the
p->running check before it.

> Then let's see when TLS is there and what we do: we'll create "two" threads
> just for that, what's even worse:
>
>   - We'll create tls handshake thread in multifd_tls_channel_connect()
>     first, setting &p->thread.
>
>   - Within the same thread, we do multifd_tls_outgoing_handshake() when
>     handshake done -> multifd_channel_connect() -> we yet create the real
>     multifd_send_thread(), setting &p->thread too.

Hmm good point, we're calling qio_task_complete() from within the
thread, that's misleading. I believe using qio_task_run_in_thread()
would put the completion callback in the main thread, right? That would
look a bit better I think because we could then join the handshake
thread before multifd_channel_connect().

> So AFAICT, the tls handshake thread is already leaked, got p->thread
> overwritten by the new thread created by itself..
>
> I think we should fix this then.  I haven't figured the best way to do,
> two things I can come up with now:
>
>   1) At least make the tls handshake thread detached.
>
>   2) Make the handshake done in multifd send thread directly; I don't yet
>      see why we must create two threads..

I'm (a little bit) against this. It's nice to know that a multifdsend_#
thread will only be doing IO and no other task. I have the same concern
about putting zero page detection in the multifd thread.

> Then assuming we have a clear model with all these threads issue fixed (no
> matter whether we'd shrink 2N threads into N threads), then what we need to
> do, IMHO, is making sure to join() all of them before destroying anything
> (say, per-channel MultiFDSendParams).  Then when we destroy everything
> safely, either mutex/sem/etc..  Because no one will race us anymore.

This doesn't address the race. There's a data dependency between the
multifd channels and the migration thread around the channels_ready
semaphore. So we cannot join the migration thread because it could be
stuck waiting for the semaphore, which means we cannot join+cleanup the
channel thread because the semaphore is still being used. That's why I'm
moving the semaphores post to the point where we join the
thread. Reading your email, now I think that should be:

qemu_sem_post(&p->sem_sync);
qemu_sem_post(&multifd_send_state->channels_ready);
qemu_thread_join(&p->thread);

Fundamentally, the issue is that we do create the multifd threads before
the migration thread, but they might not be ready before the migration
thread needs to use them.
Fabiano Rosas Nov. 10, 2023, 12:37 p.m. UTC | #3
Fabiano Rosas <farosas@suse.de> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote:
>>> We cannot operate on the multifd semaphores outside of the multifd
>>> channel thread
>>> because multifd_save_cleanup() can run in parallel and
>>> attempt to destroy the mutexes, which causes an assert.
>>> 
>>> Looking at the places where we use the semaphores aside from the
>>> migration thread, there's only the TLS handshake thread and the
>>> initial multifd_channel_connect() in the main thread. These are places
>>> where creating the multifd channel cannot fail, so releasing the
>>> semaphores at these places only serves the purpose of avoiding a
>>> deadlock when an error happens before the channel(s) have started, but
>>> after the migration thread has already called
>>> multifd_send_sync_main().
>>> 
>>> Instead of attempting to release the semaphores at those places, move
>>> the release into multifd_save_cleanup(). This puts the semaphore usage
>>> in the same thread that does the cleanup, eliminating the race.
>>> 
>>> Move the call to multifd_save_cleanup() before joining for the
>>> migration thread so we release the semaphores before.
>>> 
>>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>>> ---
>>>  migration/migration.c |  4 +++-
>>>  migration/multifd.c   | 29 +++++++++++------------------
>>>  2 files changed, 14 insertions(+), 19 deletions(-)
>>> 
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index cca32c553c..52be20561b 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>>>          QEMUFile *tmp;
>>>  
>>>          trace_migrate_fd_cleanup();
>>> +
>>> +        multifd_save_cleanup();
>>> +
>>>          qemu_mutex_unlock_iothread();
>>>          if (s->migration_thread_running) {
>>>              qemu_thread_join(&s->thread);
>>> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s)
>>>          }
>>>          qemu_mutex_lock_iothread();
>>>  
>>> -        multifd_save_cleanup();
>>>          qemu_mutex_lock(&s->qemu_file_lock);
>>>          tmp = s->to_dst_file;
>>>          s->to_dst_file = NULL;
>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>> index ec58c58082..9ca482cfbe 100644
>>> --- a/migration/multifd.c
>>> +++ b/migration/multifd.c
>>> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void)
>>>  
>>>          if (p->running) {
>>>              qemu_thread_join(&p->thread);
>>> +        } else {
>>> +            qemu_sem_post(&p->sem_sync);
>>> +            qemu_sem_post(&multifd_send_state->channels_ready);
>>
>> I think relying on p->running to join the thread is already problematic.
>>
>
> Absolutely. I've already complained about this in another thread.
>
>> Now all threads are created with JOINABLE, so we must join them to release
>> the thread resources.  Clearing "running" at the end of the thread is
>> already wrong to me, because it means if the thread quits before the main
>> thread reaching here, we will not join the thread, and the thread resource
>> will be leaked.
>>
>> Here IMHO we should set p->running=true right before thread created, and
>> never clear it.  We may even want to rename it to p->thread_created?
>>
>
> Ok, let me do that. I already have some patches touching
> multifd_new_send_channel_async() due to fixed-ram.
>
>>>          }
>>>      }
>>>      for (i = 0; i < migrate_multifd_channels(); i++) {
>>> @@ -751,8 +754,6 @@ out:
>>>          assert(local_err);
>>>          trace_multifd_send_error(p->id);
>>>          multifd_send_terminate_threads(local_err);
>>> -        qemu_sem_post(&p->sem_sync);
>>> -        qemu_sem_post(&multifd_send_state->channels_ready);
>>>          error_free(local_err);
>>>      }
>>>  
>>> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>>>  
>>>      if (!qio_task_propagate_error(task, &err)) {
>>>          trace_multifd_tls_outgoing_handshake_complete(ioc);
>>> -        if (multifd_channel_connect(p, ioc, &err)) {
>>> -            return;
>>> -        }
>>> +        multifd_channel_connect(p, ioc, NULL);
>>
>> Ignoring Error** is not good..
>>
>> I think you meant to say "it should never fail", then we should put
>> &error_abort.  Another cleaner way to do this is split the current
>> multifd_channel_connect() into tls and non-tls helpers, then we can call
>> the non-tls helpers here (which may not need an Error**).
>
> I attempted the split and it looked awkward, so I let go. But I agree
> about the Error.
>
>>> +    } else {
>>> +        /*
>>> +         * The multifd client could already be waiting to queue data,
>>> +         * so let it know that we didn't even start.
>>> +         */
>>> +        p->quit = true;
>>> +        trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>>>      }
>>> -
>>> -    trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>>> -
>>> -    /*
>>> -     * Error happen, mark multifd_send_thread status as 'quit' although it
>>> -     * is not created, and then tell who pay attention to me.
>>> -     */
>>> -    p->quit = true;
>>> -    qemu_sem_post(&multifd_send_state->channels_ready);
>>> -    qemu_sem_post(&p->sem_sync);
>>>  }
>>>  
>>>  static void *multifd_tls_handshake_thread(void *opaque)
>>> @@ -862,9 +858,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
>>>                                               QIOChannel *ioc, Error *err)
>>>  {
>>>       migrate_set_error(migrate_get_current(), err);
>>> -     /* Error happen, we need to tell who pay attention to me */
>>> -     qemu_sem_post(&multifd_send_state->channels_ready);
>>> -     qemu_sem_post(&p->sem_sync);
>>>       /*
>>>        * Although multifd_send_thread is not created, but main migration
>>>        * thread need to judge whether it is running, so we need to mark
>>> -- 
>>
>> I may still need some more time to digest your whole solution, currently
>> not very clear to me.  It may or may not be the patch's problem,
>> though.
>
> The core assumption of this patch is that we currently only need to
> release the semaphores because the multifd_send_sync_main() is allowed
> to execute even before the multifd channels have started. If creation
> failed to even start, for instance because of a TLS handshake failure,
> the migration thread will hang waiting for channels_ready (or sem_sync).
>
> Then there are two premises:
>
>  - when an error happens, multifd_save_cleanup() is always called;
>  
>  - a hanging migration thread (at multifd_send_sync_main) will not stop
>    the main thread from reaching multifd_save_cleanup.
>
> If this holds, then it's safe to release the semaphores right before
> destroying them at multifd_save_cleanup.
>
>> But let me also share how I see this..  I think we should rework the
>> multifd thread model on channel establishment.
>>
>> Firstly, as I mentioned above, we must always join() the threads if it's
>> JOINABLE or the resource is leaked, afaict.  That's the first thing to fix.
>
> I think we historically stumbled upon the fact that qemu_thread_join()
> is not the same as pthread_join(). The former takes a pointer and is not
> safe to call with a NULL QemuThread. That seems to be the reason for the
> p->running check before it.

Scratch this part, the QemuThread is not a pointer.

...should it be? Because then we can test p->thread instead of
p->running, which would be more precise and would dispense the
thread_created flag.

>> Then let's see when TLS is there and what we do: we'll create "two" threads
>> just for that, what's even worse:
>>
>>   - We'll create tls handshake thread in multifd_tls_channel_connect()
>>     first, setting &p->thread.
>>
>>   - Within the same thread, we do multifd_tls_outgoing_handshake() when
>>     handshake done -> multifd_channel_connect() -> we yet create the real
>>     multifd_send_thread(), setting &p->thread too.
>
> Hmm good point, we're calling qio_task_complete() from within the
> thread, that's misleading. I believe using qio_task_run_in_thread()
> would put the completion callback in the main thread, right? That would
> look a bit better I think because we could then join the handshake
> thread before multifd_channel_connect().
>
>> So AFAICT, the tls handshake thread is already leaked, got p->thread
>> overwritten by the new thread created by itself..
>>
>> I think we should fix this then.  I haven't figured the best way to do,
>> two things I can come up with now:
>>
>>   1) At least make the tls handshake thread detached.
>>
>>   2) Make the handshake done in multifd send thread directly; I don't yet
>>      see why we must create two threads..
>
> I'm (a little bit) against this. It's nice to know that a multifdsend_#
> thread will only be doing IO and no other task. I have the same concern
> about putting zero page detection in the multifd thread.
>
>> Then assuming we have a clear model with all these threads issue fixed (no
>> matter whether we'd shrink 2N threads into N threads), then what we need to
>> do, IMHO, is making sure to join() all of them before destroying anything
>> (say, per-channel MultiFDSendParams).  Then when we destroy everything
>> safely, either mutex/sem/etc..  Because no one will race us anymore.
>
> This doesn't address the race. There's a data dependency between the
> multifd channels and the migration thread around the channels_ready
> semaphore. So we cannot join the migration thread because it could be
> stuck waiting for the semaphore, which means we cannot join+cleanup the
> channel thread because the semaphore is still being used. That's why I'm
> moving the semaphores post to the point where we join the
> thread. Reading your email, now I think that should be:
>
> qemu_sem_post(&p->sem_sync);
> qemu_sem_post(&multifd_send_state->channels_ready);
> qemu_thread_join(&p->thread);
>
> Fundamentally, the issue is that we do create the multifd threads before
> the migration thread, but they might not be ready before the migration
> thread needs to use them.
Peter Xu Nov. 13, 2023, 4:45 p.m. UTC | #4
On Fri, Nov 10, 2023 at 09:05:41AM -0300, Fabiano Rosas wrote:

[...]

> > Then assuming we have a clear model with all these threads issue fixed (no
> > matter whether we'd shrink 2N threads into N threads), then what we need to
> > do, IMHO, is making sure to join() all of them before destroying anything
> > (say, per-channel MultiFDSendParams).  Then when we destroy everything
> > safely, either mutex/sem/etc..  Because no one will race us anymore.
> 
> This doesn't address the race. There's a data dependency between the
> multifd channels and the migration thread around the channels_ready
> semaphore. So we cannot join the migration thread because it could be
> stuck waiting for the semaphore, which means we cannot join+cleanup the
> channel thread because the semaphore is still being used.

I think this is the major part of confusion, on why this can happen.

The problem is afaik multifd_save_cleanup() is only called by
migrate_fd_cleanup(), which is further only called in:

  1) migrate_fd_cleanup_bh()
  2) migrate_fd_connect()

For 1): it's only run when migration comletes/fails/etc. (in all cases,
right before it quits..) and then kicks off migrate_fd_cleanup_schedule().
So migration thread shouldn't be stuck, afaiu, or it won't be able to kick
that BH.

For 2): it's called by the main thread, where migration thread should have
not yet been created.

With that, I don't see how migrate_fd_cleanup() would need to worry about
migration thread

Did I miss something?

Thanks,
Fabiano Rosas Nov. 14, 2023, 1:50 a.m. UTC | #5
Peter Xu <peterx@redhat.com> writes:

> On Fri, Nov 10, 2023 at 09:05:41AM -0300, Fabiano Rosas wrote:
>
> [...]
>
>> > Then assuming we have a clear model with all these threads issue fixed (no
>> > matter whether we'd shrink 2N threads into N threads), then what we need to
>> > do, IMHO, is making sure to join() all of them before destroying anything
>> > (say, per-channel MultiFDSendParams).  Then when we destroy everything
>> > safely, either mutex/sem/etc..  Because no one will race us anymore.
>> 
>> This doesn't address the race. There's a data dependency between the
>> multifd channels and the migration thread around the channels_ready
>> semaphore. So we cannot join the migration thread because it could be
>> stuck waiting for the semaphore, which means we cannot join+cleanup the
>> channel thread because the semaphore is still being used.
>
> I think this is the major part of confusion, on why this can happen.
>
> The problem is afaik multifd_save_cleanup() is only called by
> migrate_fd_cleanup(), which is further only called in:
>
>   1) migrate_fd_cleanup_bh()
>   2) migrate_fd_connect()
>
> For 1): it's only run when migration comletes/fails/etc. (in all cases,
> right before it quits..) and then kicks off migrate_fd_cleanup_schedule().
> So migration thread shouldn't be stuck, afaiu, or it won't be able to kick
> that BH.
>
> For 2): it's called by the main thread, where migration thread should have
> not yet been created.
>
> With that, I don't see how migrate_fd_cleanup() would need to worry about
> migration thread
>
> Did I miss something?

There are two points:

1) multifd_new_send_channel_async() doesn't set an Error. Even if
multifd_channel_connect() fails, we'll still continue with
migrate_fd_connect(). I don't see any code that looks at the migration
error (s->error).

2) the TLS handshake thread of one of the channels could simply not get
any chance to run until something else fails and we reach
multifd_save_cleanup() from the BH path.

This second point in particular is why I don't think simply joining the
TLS thread will avoid the race. There's nothing linking the multifd
channels together, we could have 7 of them operational and a 8th one
still going through the TLS handshake.

That said, I'm not sure about the exact path we take to reach the bug
situation. It's very hard to reproduce so I'm relying entirely on code
inspection.
Peter Xu Nov. 14, 2023, 5:28 p.m. UTC | #6
On Mon, Nov 13, 2023 at 10:50:39PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Nov 10, 2023 at 09:05:41AM -0300, Fabiano Rosas wrote:
> >
> > [...]
> >
> >> > Then assuming we have a clear model with all these threads issue fixed (no
> >> > matter whether we'd shrink 2N threads into N threads), then what we need to
> >> > do, IMHO, is making sure to join() all of them before destroying anything
> >> > (say, per-channel MultiFDSendParams).  Then when we destroy everything
> >> > safely, either mutex/sem/etc..  Because no one will race us anymore.
> >> 
> >> This doesn't address the race. There's a data dependency between the
> >> multifd channels and the migration thread around the channels_ready
> >> semaphore. So we cannot join the migration thread because it could be
> >> stuck waiting for the semaphore, which means we cannot join+cleanup the
> >> channel thread because the semaphore is still being used.
> >
> > I think this is the major part of confusion, on why this can happen.
> >
> > The problem is afaik multifd_save_cleanup() is only called by
> > migrate_fd_cleanup(), which is further only called in:
> >
> >   1) migrate_fd_cleanup_bh()
> >   2) migrate_fd_connect()
> >
> > For 1): it's only run when migration comletes/fails/etc. (in all cases,
> > right before it quits..) and then kicks off migrate_fd_cleanup_schedule().
> > So migration thread shouldn't be stuck, afaiu, or it won't be able to kick
> > that BH.
> >
> > For 2): it's called by the main thread, where migration thread should have
> > not yet been created.
> >
> > With that, I don't see how migrate_fd_cleanup() would need to worry about
> > migration thread
> >
> > Did I miss something?
> 
> There are two points:
> 
> 1) multifd_new_send_channel_async() doesn't set an Error. Even if
> multifd_channel_connect() fails, we'll still continue with
> migrate_fd_connect(). I don't see any code that looks at the migration
> error (s->error).
> 
> 2) the TLS handshake thread of one of the channels could simply not get
> any chance to run until something else fails and we reach
> multifd_save_cleanup() from the BH path.
> 
> This second point in particular is why I don't think simply joining the
> TLS thread will avoid the race. There's nothing linking the multifd
> channels together, we could have 7 of them operational and a 8th one
> still going through the TLS handshake.
> 
> That said, I'm not sure about the exact path we take to reach the bug
> situation. It's very hard to reproduce so I'm relying entirely on code
> inspection.

This whole patch, iiuc, was trying to move sem post, which will only kick
the migration thread at different places.  IMHO, the problem is if
multifd_save_cleanup() is only called in either migrate_fd_connect() or the
BH as discussed above, then it means migration thread is already gone!  I
don't see how moving that sem_post() would help in any form.  It means
whatever thread stuck as you said can still happen.

You're right that I think the thread is just fully out of control of
migration. Namely, anything we created with socket_send_channel_create().
Fundamentally, I think it's because qio_task_run_in_thread() doesn't
support that control, as creating its thread as DETACHED already, leaving
the rest to luck.

Can we provide our own threads for that, at least keep the threadID around,
then at cleanup paths we can shutdown() the IOChannels and join(), assuming
that the shutdown() will kick the thread out of any blocked IO?
Juan Quintela Nov. 16, 2023, 2:56 p.m. UTC | #7
Peter Xu <peterx@redhat.com> wrote:
> On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote:
>> We cannot operate on the multifd semaphores outside of the multifd
>> channel thread
>> because multifd_save_cleanup() can run in parallel and
>> attempt to destroy the mutexes, which causes an assert.
>> 
>> Looking at the places where we use the semaphores aside from the
>> migration thread, there's only the TLS handshake thread and the
>> initial multifd_channel_connect() in the main thread. These are places
>> where creating the multifd channel cannot fail, so releasing the
>> semaphores at these places only serves the purpose of avoiding a
>> deadlock when an error happens before the channel(s) have started, but
>> after the migration thread has already called
>> multifd_send_sync_main().
>> 
>> Instead of attempting to release the semaphores at those places, move
>> the release into multifd_save_cleanup(). This puts the semaphore usage
>> in the same thread that does the cleanup, eliminating the race.
>> 
>> Move the call to multifd_save_cleanup() before joining for the
>> migration thread so we release the semaphores before.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  migration/migration.c |  4 +++-
>>  migration/multifd.c   | 29 +++++++++++------------------
>>  2 files changed, 14 insertions(+), 19 deletions(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index cca32c553c..52be20561b 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>>          QEMUFile *tmp;
>>  
>>          trace_migrate_fd_cleanup();
>> +
>> +        multifd_save_cleanup();
>> +
>>          qemu_mutex_unlock_iothread();
>>          if (s->migration_thread_running) {
>>              qemu_thread_join(&s->thread);
>> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s)
>>          }
>>          qemu_mutex_lock_iothread();
>>  
>> -        multifd_save_cleanup();
>>          qemu_mutex_lock(&s->qemu_file_lock);
>>          tmp = s->to_dst_file;
>>          s->to_dst_file = NULL;
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index ec58c58082..9ca482cfbe 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void)
>>  
>>          if (p->running) {
>>              qemu_thread_join(&p->thread);
>> +        } else {
>> +            qemu_sem_post(&p->sem_sync);
>> +            qemu_sem_post(&multifd_send_state->channels_ready);
>
> I think relying on p->running to join the thread is already problematic.

Why?

> Now all threads are created with JOINABLE, so we must join them to release
> the thread resources.  Clearing "running" at the end of the thread is
> already wrong to me, because it means if the thread quits before the main
> thread reaching here, we will not join the thread, and the thread resource
> will be leaked.

The bug is that the thread quits from other place.
It is not different that forgetting to do a mutex_unlock().  It is an
error that needs fixing.  

> Here IMHO we should set p->running=true right before thread created,

(that is basically what is done)

Meaning of ->running is that multifd thread has started running.  it a
false is that it is not running normally anymore.

thread function:

> and never clear it.

static void *multifd_send_thread(void *opaque)
{
    // No error can happens here

    while (true) {
    // No return command here, just breaks
    }

out:
    // This can fail

    qemu_mutex_lock(&p->mutex);
    p->running = false;
    qemu_mutex_unlock(&p->mutex);

    /* this can't fail */

    return NULL;
}


What running here means is that we don't need to "stop" this thread
anymore.  That happens as soon as we get out of the loop.

>  We may even want to rename it to p->thread_created?

We can rename it, but I am not sure if it buys it too much.  Notice that
we also mean that we have created the channel.

>
>>          }
>>      }
>>      for (i = 0; i < migrate_multifd_channels(); i++) {
>> @@ -751,8 +754,6 @@ out:
>>          assert(local_err);
>>          trace_multifd_send_error(p->id);
>>          multifd_send_terminate_threads(local_err);
>> -        qemu_sem_post(&p->sem_sync);
>> -        qemu_sem_post(&multifd_send_state->channels_ready);
>>          error_free(local_err);
>>      }
>>  
>> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>>  
>>      if (!qio_task_propagate_error(task, &err)) {
>>          trace_multifd_tls_outgoing_handshake_complete(ioc);
>> -        if (multifd_channel_connect(p, ioc, &err)) {
>> -            return;
>> -        }
>> +        multifd_channel_connect(p, ioc, NULL);
>
> Ignoring Error** is not good..
>
> I think you meant to say "it should never fail", then we should put
> &error_abort.  Another cleaner way to do this is split the current
> multifd_channel_connect() into tls and non-tls helpers, then we can call
> the non-tls helpers here (which may not need an Error**).

This code is really weird because it is (very) asynchronous.
And TLS is even worse, because we need to do the equilent of two
listen().
Sniff.

>> +    } else {
>> +        /*
>> +         * The multifd client could already be waiting to queue data,
>> +         * so let it know that we didn't even start.
>> +         */
>> +        p->quit = true;
>> +        trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>>      }
>> -
>> -    trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>> -
>> -    /*
>> -     * Error happen, mark multifd_send_thread status as 'quit' although it
>> -     * is not created, and then tell who pay attention to me.
>> -     */
>> -    p->quit = true;
>> -    qemu_sem_post(&multifd_send_state->channels_ready);
>> -    qemu_sem_post(&p->sem_sync);
>>  }
>>  
>>  static void *multifd_tls_handshake_thread(void *opaque)
>> @@ -862,9 +858,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
>>                                               QIOChannel *ioc, Error *err)
>>  {
>>       migrate_set_error(migrate_get_current(), err);
>> -     /* Error happen, we need to tell who pay attention to me */
>> -     qemu_sem_post(&multifd_send_state->channels_ready);
>> -     qemu_sem_post(&p->sem_sync);
>>       /*
>>        * Although multifd_send_thread is not created, but main migration
>>        * thread need to judge whether it is running, so we need to mark
>> -- 
>
> I may still need some more time to digest your whole solution, currently
> not very clear to me.  It may or may not be the patch's problem, though.
>
> But let me also share how I see this..  I think we should rework the
> multifd thread model on channel establishment.
>
> Firstly, as I mentioned above, we must always join() the threads if it's
> JOINABLE or the resource is leaked, afaict.  That's the first thing to fix.

The problem that running is trying to fix is this:
* we get an additional error while we are exiting due to any reason.

So we end calling multifd_fd_cleanup() several times, and things get
really weird.

So what we really need is to be sure that:
- we join each thread once and only once

	https://linux.die.net/man/3/pthread_join

	Joining with a thread that has previously been joined results in undefined behavior.


> Then let's see when TLS is there and what we do: we'll create "two" threads
> just for that, what's even worse:
>
>   - We'll create tls handshake thread in multifd_tls_channel_connect()
>     first, setting &p->thread.
>
>   - Within the same thread, we do multifd_tls_outgoing_handshake() when
>     handshake done -> multifd_channel_connect() -> we yet create the real
>     multifd_send_thread(), setting &p->thread too.
>
> So AFAICT, the tls handshake thread is already leaked, got p->thread
> overwritten by the new thread created by itself..
>
> I think we should fix this then.  I haven't figured the best way to do,
> two things I can come up with now:
>
>   1) At least make the tls handshake thread detached.
>
>   2) Make the handshake done in multifd send thread directly; I don't yet
>      see why we must create two threads..

The way we listen() in sockets anynchronously. QIO channels are a
mystery to me, and this was really weird.

Using glib_main_loop() has lots of advantages.  Understanding how to
listen asynchronously is not one of them.

> Then assuming we have a clear model with all these threads issue fixed (no
> matter whether we'd shrink 2N threads into N threads), then what we need to
> do, IMHO, is making sure to join() all of them before destroying anything
> (say, per-channel MultiFDSendParams).  Then when we destroy everything
> safely, either mutex/sem/etc..  Because no one will race us anymore.

> Would it make sense?

I can't see why we need to join the threads created for tls.  That ones
could be not joinable and just call it a day.

Later, Juan.
Juan Quintela Nov. 16, 2023, 3:44 p.m. UTC | #8
Fabiano Rosas <farosas@suse.de> wrote:
> Peter Xu <peterx@redhat.com> writes:
>
>> On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote:
>>> We cannot operate on the multifd semaphores outside of the multifd
>>> channel thread
>>> because multifd_save_cleanup() can run in parallel and
>>> attempt to destroy the mutexes, which causes an assert.
>>> 
>>> Looking at the places where we use the semaphores aside from the
>>> migration thread, there's only the TLS handshake thread and the
>>> initial multifd_channel_connect() in the main thread. These are places
>>> where creating the multifd channel cannot fail, so releasing the
>>> semaphores at these places only serves the purpose of avoiding a
>>> deadlock when an error happens before the channel(s) have started, but
>>> after the migration thread has already called
>>> multifd_send_sync_main().
>>> 
>>> Instead of attempting to release the semaphores at those places, move
>>> the release into multifd_save_cleanup(). This puts the semaphore usage
>>> in the same thread that does the cleanup, eliminating the race.
>>> 
>>> Move the call to multifd_save_cleanup() before joining for the
>>> migration thread so we release the semaphores before.
>>> 
>>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>>> ---
>>>  migration/migration.c |  4 +++-
>>>  migration/multifd.c   | 29 +++++++++++------------------
>>>  2 files changed, 14 insertions(+), 19 deletions(-)
>>> 
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index cca32c553c..52be20561b 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>>>          QEMUFile *tmp;
>>>  
>>>          trace_migrate_fd_cleanup();
>>> +
>>> +        multifd_save_cleanup();
>>> +
>>>          qemu_mutex_unlock_iothread();
>>>          if (s->migration_thread_running) {
>>>              qemu_thread_join(&s->thread);
>>> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s)
>>>          }
>>>          qemu_mutex_lock_iothread();
>>>  
>>> -        multifd_save_cleanup();
>>>          qemu_mutex_lock(&s->qemu_file_lock);
>>>          tmp = s->to_dst_file;
>>>          s->to_dst_file = NULL;
>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>> index ec58c58082..9ca482cfbe 100644
>>> --- a/migration/multifd.c
>>> +++ b/migration/multifd.c
>>> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void)
>>>  
>>>          if (p->running) {
>>>              qemu_thread_join(&p->thread);
>>> +        } else {
>>> +            qemu_sem_post(&p->sem_sync);
>>> +            qemu_sem_post(&multifd_send_state->channels_ready);
>>
>> I think relying on p->running to join the thread is already problematic.
>>
>
> Absolutely. I've already complained about this in another thread.

I haven't seen that yet.
But if there is a bug, of course that we need to fix it.

>> Now all threads are created with JOINABLE, so we must join them to release
>> the thread resources.  Clearing "running" at the end of the thread is
>> already wrong to me, because it means if the thread quits before the main
>> thread reaching here, we will not join the thread, and the thread resource
>> will be leaked.
>>
>> Here IMHO we should set p->running=true right before thread created, and
>> never clear it.  We may even want to rename it to p->thread_created?
>>
>
> Ok, let me do that. I already have some patches touching
> multifd_new_send_channel_async() due to fixed-ram.

I still think this is a bad idea.  What that value means is if we have
to send a post/whatever or not.
What I could agree is that when we put ->running to false, we ceauld
also shutdown the QIO channel.

>>>          }
>>>      }
>>>      for (i = 0; i < migrate_multifd_channels(); i++) {
>>> @@ -751,8 +754,6 @@ out:
>>>          assert(local_err);
>>>          trace_multifd_send_error(p->id);
>>>          multifd_send_terminate_threads(local_err);
>>> -        qemu_sem_post(&p->sem_sync);
>>> -        qemu_sem_post(&multifd_send_state->channels_ready);
>>>          error_free(local_err);
>>>      }
>>>  
>>> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>>>  
>>>      if (!qio_task_propagate_error(task, &err)) {
>>>          trace_multifd_tls_outgoing_handshake_complete(ioc);
>>> -        if (multifd_channel_connect(p, ioc, &err)) {
>>> -            return;
>>> -        }
>>> +        multifd_channel_connect(p, ioc, NULL);
>>
>> Ignoring Error** is not good..
>>
>> I think you meant to say "it should never fail", then we should put
>> &error_abort.  Another cleaner way to do this is split the current
>> multifd_channel_connect() into tls and non-tls helpers, then we can call
>> the non-tls helpers here (which may not need an Error**).
>
> I attempted the split and it looked awkward, so I let go. But I agree
> about the Error.

As said, that code is really weird because tcp+tls is basically two
listen().  I never understood that part of the code, I think it was
daniel who wrote it.


>> I may still need some more time to digest your whole solution, currently
>> not very clear to me.  It may or may not be the patch's problem,
>> though.
>
> The core assumption of this patch is that we currently only need to
> release the semaphores because the multifd_send_sync_main() is allowed
> to execute even before the multifd channels have started. If creation
> failed to even start, for instance because of a TLS handshake failure,
> the migration thread will hang waiting for channels_ready (or sem_sync).
>
> Then there are two premises:
>
>  - when an error happens, multifd_save_cleanup() is always called;

That should be happen.  The problem is that nothing guarantees that we
only have one error.

Or that while we are sending something test migration_has_error() and
decides to stop also on the error path.

This needs to be audited.  I agree that we can do better here, but it is
not trivial.  Full change is 9.0 material.

>  - a hanging migration thread (at multifd_send_sync_main) will not stop
>    the main thread from reaching multifd_save_cleanup.

I don't remember to be fair.  But I agree that we shouldn't wait.

> If this holds, then it's safe to release the semaphores right before
> destroying them at multifd_save_cleanup.

>> But let me also share how I see this..  I think we should rework the
>> multifd thread model on channel establishment.
>>
>> Firstly, as I mentioned above, we must always join() the threads if it's
>> JOINABLE or the resource is leaked, afaict.  That's the first thing to fix.
>
> I think we historically stumbled upon the fact that qemu_thread_join()
> is not the same as pthread_join(). The former takes a pointer and is not
> safe to call with a NULL QemuThread. That seems to be the reason for the
> p->running check before it.

As I pointed on the other thread, Joining a thread twice is undefined
behaviour, so we need to be careful here.
We return NULL, so it is not clear to me that we need to do anything
else there.

We could return an Error, but the problem here is that we want to tell
the user what is the 1st error, not any other error.

>> Then let's see when TLS is there and what we do: we'll create "two" threads
>> just for that, what's even worse:
>>
>>   - We'll create tls handshake thread in multifd_tls_channel_connect()
>>     first, setting &p->thread.
>>
>>   - Within the same thread, we do multifd_tls_outgoing_handshake() when
>>     handshake done -> multifd_channel_connect() -> we yet create the real
>>     multifd_send_thread(), setting &p->thread too.
>
> Hmm good point, we're calling qio_task_complete() from within the
> thread, that's misleading. I believe using qio_task_run_in_thread()
> would put the completion callback in the main thread, right? That would
> look a bit better I think because we could then join the handshake
> thread before multifd_channel_connect().

>> So AFAICT, the tls handshake thread is already leaked, got p->thread
>> overwritten by the new thread created by itself..
>>
>> I think we should fix this then.  I haven't figured the best way to do,
>> two things I can come up with now:
>>
>>   1) At least make the tls handshake thread detached.
>>
>>   2) Make the handshake done in multifd send thread directly; I don't yet
>>      see why we must create two threads..
>
> I'm (a little bit) against this. It's nice to know that a multifdsend_#
> thread will only be doing IO and no other task. I have the same concern
> about putting zero page detection in the multifd thread.
>
>> Then assuming we have a clear model with all these threads issue fixed (no
>> matter whether we'd shrink 2N threads into N threads), then what we need to
>> do, IMHO, is making sure to join() all of them before destroying anything
>> (say, per-channel MultiFDSendParams).  Then when we destroy everything
>> safely, either mutex/sem/etc..  Because no one will race us anymore.
>
> This doesn't address the race. There's a data dependency between the
> multifd channels and the migration thread around the channels_ready
> semaphore. So we cannot join the migration thread because it could be
> stuck waiting for the semaphore, which means we cannot join+cleanup the
> channel thread because the semaphore is still being used. That's why I'm
> moving the semaphores post to the point where we join the
> thread. Reading your email, now I think that should be:
>
> qemu_sem_post(&p->sem_sync);
> qemu_sem_post(&multifd_send_state->channels_ready);
> qemu_thread_join(&p->thread);
>
> Fundamentally, the issue is that we do create the multifd threads before
> the migration thread, but they might not be ready before the migration
> thread needs to use them.

Welcome the the async world.

But I agree that your idea could work.

Later, Juan.
Juan Quintela Nov. 16, 2023, 3:51 p.m. UTC | #9
Fabiano Rosas <farosas@suse.de> wrote:
> Fabiano Rosas <farosas@suse.de> writes:
>
>> Peter Xu <peterx@redhat.com> writes:
>>
>>> On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote:

>> I think we historically stumbled upon the fact that qemu_thread_join()
>> is not the same as pthread_join(). The former takes a pointer and is not
>> safe to call with a NULL QemuThread. That seems to be the reason for the
>> p->running check before it.
>
> Scratch this part, the QemuThread is not a pointer.
>
> ...should it be? Because then we can test p->thread instead of
> p->running, which would be more precise and would dispense the
> thread_created flag.

You still need to make sure that you don't join the thread twice.
And we do the qemu_pthread_join() without any lock.

Later, Juan.
Fabiano Rosas Nov. 16, 2023, 6:13 p.m. UTC | #10
Juan Quintela <quintela@redhat.com> writes:

> Peter Xu <peterx@redhat.com> wrote:
>> On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote:
>>> We cannot operate on the multifd semaphores outside of the multifd
>>> channel thread
>>> because multifd_save_cleanup() can run in parallel and
>>> attempt to destroy the mutexes, which causes an assert.
>>> 
>>> Looking at the places where we use the semaphores aside from the
>>> migration thread, there's only the TLS handshake thread and the
>>> initial multifd_channel_connect() in the main thread. These are places
>>> where creating the multifd channel cannot fail, so releasing the
>>> semaphores at these places only serves the purpose of avoiding a
>>> deadlock when an error happens before the channel(s) have started, but
>>> after the migration thread has already called
>>> multifd_send_sync_main().
>>> 
>>> Instead of attempting to release the semaphores at those places, move
>>> the release into multifd_save_cleanup(). This puts the semaphore usage
>>> in the same thread that does the cleanup, eliminating the race.
>>> 
>>> Move the call to multifd_save_cleanup() before joining for the
>>> migration thread so we release the semaphores before.
>>> 
>>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>>> ---
>>>  migration/migration.c |  4 +++-
>>>  migration/multifd.c   | 29 +++++++++++------------------
>>>  2 files changed, 14 insertions(+), 19 deletions(-)
>>> 
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index cca32c553c..52be20561b 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>>>          QEMUFile *tmp;
>>>  
>>>          trace_migrate_fd_cleanup();
>>> +
>>> +        multifd_save_cleanup();
>>> +
>>>          qemu_mutex_unlock_iothread();
>>>          if (s->migration_thread_running) {
>>>              qemu_thread_join(&s->thread);
>>> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s)
>>>          }
>>>          qemu_mutex_lock_iothread();
>>>  
>>> -        multifd_save_cleanup();
>>>          qemu_mutex_lock(&s->qemu_file_lock);
>>>          tmp = s->to_dst_file;
>>>          s->to_dst_file = NULL;
>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>> index ec58c58082..9ca482cfbe 100644
>>> --- a/migration/multifd.c
>>> +++ b/migration/multifd.c
>>> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void)
>>>  
>>>          if (p->running) {
>>>              qemu_thread_join(&p->thread);
>>> +        } else {
>>> +            qemu_sem_post(&p->sem_sync);
>>> +            qemu_sem_post(&multifd_send_state->channels_ready);
>>
>> I think relying on p->running to join the thread is already problematic.
>
> Why?

The channel holds resources that need to be freed by
multifd_save_cleanup(). We cannot reliably do so while the channel
itself is the one responsible for telling the main thread whether it is
done with those resources. 

p->running works fine for knowing "has the thread been created", so we
can use it to avoid joining if it was never created. But it is bad for
knowing "is the thread still running" because as soon as the channel
sets p->running=false, multifd_save_cleanup() could attempt to destroy
the p->mutex and the semaphores.

The bug reported in the other series was about this already:

[PATCH] migrate/multifd: fix coredump when the multifd thread cleanup
https://lore.kernel.org/r/20230621081826.3203053-1-zhangjianguo18@huawei.com

>> Now all threads are created with JOINABLE, so we must join them to release
>> the thread resources.  Clearing "running" at the end of the thread is
>> already wrong to me, because it means if the thread quits before the main
>> thread reaching here, we will not join the thread, and the thread resource
>> will be leaked.
>
> The bug is that the thread quits from other place.
> It is not different that forgetting to do a mutex_unlock().  It is an
> error that needs fixing.  
>
>> Here IMHO we should set p->running=true right before thread created,
>
> (that is basically what is done)
>
> Meaning of ->running is that multifd thread has started running.  it a
> false is that it is not running normally anymore.

The issue is that multifd_save_cleanup looks at p->running = false and
continues with the cleanup that destroys the mutex while the TLS thread
might be using it.

>
> thread function:
>
>> and never clear it.
>
> static void *multifd_send_thread(void *opaque)
> {
>     // No error can happens here
>
>     while (true) {
>     // No return command here, just breaks
>     }
>
> out:
>     // This can fail
>
>     qemu_mutex_lock(&p->mutex);
>     p->running = false;
>     qemu_mutex_unlock(&p->mutex);
>
>     /* this can't fail */
>
>     return NULL;
> }
>
>
> What running here means is that we don't need to "stop" this thread
> anymore.  That happens as soon as we get out of the loop.
>
>>  We may even want to rename it to p->thread_created?
>
> We can rename it, but I am not sure if it buys it too much.  Notice that
> we also mean that we have created the channel.
>
>>
>>>          }
>>>      }
>>>      for (i = 0; i < migrate_multifd_channels(); i++) {
>>> @@ -751,8 +754,6 @@ out:
>>>          assert(local_err);
>>>          trace_multifd_send_error(p->id);
>>>          multifd_send_terminate_threads(local_err);
>>> -        qemu_sem_post(&p->sem_sync);
>>> -        qemu_sem_post(&multifd_send_state->channels_ready);
>>>          error_free(local_err);
>>>      }
>>>  
>>> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>>>  
>>>      if (!qio_task_propagate_error(task, &err)) {
>>>          trace_multifd_tls_outgoing_handshake_complete(ioc);
>>> -        if (multifd_channel_connect(p, ioc, &err)) {
>>> -            return;
>>> -        }
>>> +        multifd_channel_connect(p, ioc, NULL);
>>
>> Ignoring Error** is not good..
>>
>> I think you meant to say "it should never fail", then we should put
>> &error_abort.  Another cleaner way to do this is split the current
>> multifd_channel_connect() into tls and non-tls helpers, then we can call
>> the non-tls helpers here (which may not need an Error**).
>
> This code is really weird because it is (very) asynchronous.
> And TLS is even worse, because we need to do the equilent of two
> listen().
> Sniff.
>
>>> +    } else {
>>> +        /*
>>> +         * The multifd client could already be waiting to queue data,
>>> +         * so let it know that we didn't even start.
>>> +         */
>>> +        p->quit = true;
>>> +        trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>>>      }
>>> -
>>> -    trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>>> -
>>> -    /*
>>> -     * Error happen, mark multifd_send_thread status as 'quit' although it
>>> -     * is not created, and then tell who pay attention to me.
>>> -     */
>>> -    p->quit = true;
>>> -    qemu_sem_post(&multifd_send_state->channels_ready);
>>> -    qemu_sem_post(&p->sem_sync);
>>>  }
>>>  
>>>  static void *multifd_tls_handshake_thread(void *opaque)
>>> @@ -862,9 +858,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
>>>                                               QIOChannel *ioc, Error *err)
>>>  {
>>>       migrate_set_error(migrate_get_current(), err);
>>> -     /* Error happen, we need to tell who pay attention to me */
>>> -     qemu_sem_post(&multifd_send_state->channels_ready);
>>> -     qemu_sem_post(&p->sem_sync);
>>>       /*
>>>        * Although multifd_send_thread is not created, but main migration
>>>        * thread need to judge whether it is running, so we need to mark
>>> -- 
>>
>> I may still need some more time to digest your whole solution, currently
>> not very clear to me.  It may or may not be the patch's problem, though.
>>
>> But let me also share how I see this..  I think we should rework the
>> multifd thread model on channel establishment.
>>
>> Firstly, as I mentioned above, we must always join() the threads if it's
>> JOINABLE or the resource is leaked, afaict.  That's the first thing to fix.
>
> The problem that running is trying to fix is this:
> * we get an additional error while we are exiting due to any reason.
>
> So we end calling multifd_fd_cleanup() several times, and things get
> really weird.
>
> So what we really need is to be sure that:
> - we join each thread once and only once
>
> 	https://linux.die.net/man/3/pthread_join
>
> 	Joining with a thread that has previously been joined results in undefined behavior.
>

I don't see how we would call multifd_save_cleanup() more than once. The
qemu_mutex_destroy() calls in the loop below would already assert if
called twice.

>> Then let's see when TLS is there and what we do: we'll create "two" threads
>> just for that, what's even worse:
>>
>>   - We'll create tls handshake thread in multifd_tls_channel_connect()
>>     first, setting &p->thread.
>>
>>   - Within the same thread, we do multifd_tls_outgoing_handshake() when
>>     handshake done -> multifd_channel_connect() -> we yet create the real
>>     multifd_send_thread(), setting &p->thread too.
>>
>> So AFAICT, the tls handshake thread is already leaked, got p->thread
>> overwritten by the new thread created by itself..
>>
>> I think we should fix this then.  I haven't figured the best way to do,
>> two things I can come up with now:
>>
>>   1) At least make the tls handshake thread detached.
>>
>>   2) Make the handshake done in multifd send thread directly; I don't yet
>>      see why we must create two threads..
>
> The way we listen() in sockets anynchronously. QIO channels are a
> mystery to me, and this was really weird.
>
> Using glib_main_loop() has lots of advantages.  Understanding how to
> listen asynchronously is not one of them.
>
>> Then assuming we have a clear model with all these threads issue fixed (no
>> matter whether we'd shrink 2N threads into N threads), then what we need to
>> do, IMHO, is making sure to join() all of them before destroying anything
>> (say, per-channel MultiFDSendParams).  Then when we destroy everything
>> safely, either mutex/sem/etc..  Because no one will race us anymore.
>
>> Would it make sense?
>
> I can't see why we need to join the threads created for tls.  That ones
> could be not joinable and just call it a day.

From the perspective of this series, we just need to be sure
multifd_tls_outgoing_handshake() never runs in parallel with
multifd_save_cleanup(). Joining the thread would be one way to do that.
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index cca32c553c..52be20561b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1300,6 +1300,9 @@  static void migrate_fd_cleanup(MigrationState *s)
         QEMUFile *tmp;
 
         trace_migrate_fd_cleanup();
+
+        multifd_save_cleanup();
+
         qemu_mutex_unlock_iothread();
         if (s->migration_thread_running) {
             qemu_thread_join(&s->thread);
@@ -1307,7 +1310,6 @@  static void migrate_fd_cleanup(MigrationState *s)
         }
         qemu_mutex_lock_iothread();
 
-        multifd_save_cleanup();
         qemu_mutex_lock(&s->qemu_file_lock);
         tmp = s->to_dst_file;
         s->to_dst_file = NULL;
diff --git a/migration/multifd.c b/migration/multifd.c
index ec58c58082..9ca482cfbe 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -527,6 +527,9 @@  void multifd_save_cleanup(void)
 
         if (p->running) {
             qemu_thread_join(&p->thread);
+        } else {
+            qemu_sem_post(&p->sem_sync);
+            qemu_sem_post(&multifd_send_state->channels_ready);
         }
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
@@ -751,8 +754,6 @@  out:
         assert(local_err);
         trace_multifd_send_error(p->id);
         multifd_send_terminate_threads(local_err);
-        qemu_sem_post(&p->sem_sync);
-        qemu_sem_post(&multifd_send_state->channels_ready);
         error_free(local_err);
     }
 
@@ -780,20 +781,15 @@  static void multifd_tls_outgoing_handshake(QIOTask *task,
 
     if (!qio_task_propagate_error(task, &err)) {
         trace_multifd_tls_outgoing_handshake_complete(ioc);
-        if (multifd_channel_connect(p, ioc, &err)) {
-            return;
-        }
+        multifd_channel_connect(p, ioc, NULL);
+    } else {
+        /*
+         * The multifd client could already be waiting to queue data,
+         * so let it know that we didn't even start.
+         */
+        p->quit = true;
+        trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
     }
-
-    trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
-
-    /*
-     * Error happen, mark multifd_send_thread status as 'quit' although it
-     * is not created, and then tell who pay attention to me.
-     */
-    p->quit = true;
-    qemu_sem_post(&multifd_send_state->channels_ready);
-    qemu_sem_post(&p->sem_sync);
 }
 
 static void *multifd_tls_handshake_thread(void *opaque)
@@ -862,9 +858,6 @@  static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
                                              QIOChannel *ioc, Error *err)
 {
      migrate_set_error(migrate_get_current(), err);
-     /* Error happen, we need to tell who pay attention to me */
-     qemu_sem_post(&multifd_send_state->channels_ready);
-     qemu_sem_post(&p->sem_sync);
      /*
       * Although multifd_send_thread is not created, but main migration
       * thread need to judge whether it is running, so we need to mark