Message ID | 20231109165856.15224-3-farosas@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | migration: Fix multifd qemu_mutex_destroy race | expand |
On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote: > We cannot operate on the multifd semaphores outside of the multifd > channel thread > because multifd_save_cleanup() can run in parallel and > attempt to destroy the mutexes, which causes an assert. > > Looking at the places where we use the semaphores aside from the > migration thread, there's only the TLS handshake thread and the > initial multifd_channel_connect() in the main thread. These are places > where creating the multifd channel cannot fail, so releasing the > semaphores at these places only serves the purpose of avoiding a > deadlock when an error happens before the channel(s) have started, but > after the migration thread has already called > multifd_send_sync_main(). > > Instead of attempting to release the semaphores at those places, move > the release into multifd_save_cleanup(). This puts the semaphore usage > in the same thread that does the cleanup, eliminating the race. > > Move the call to multifd_save_cleanup() before joining for the > migration thread so we release the semaphores before. > > Signed-off-by: Fabiano Rosas <farosas@suse.de> > --- > migration/migration.c | 4 +++- > migration/multifd.c | 29 +++++++++++------------------ > 2 files changed, 14 insertions(+), 19 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index cca32c553c..52be20561b 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s) > QEMUFile *tmp; > > trace_migrate_fd_cleanup(); > + > + multifd_save_cleanup(); > + > qemu_mutex_unlock_iothread(); > if (s->migration_thread_running) { > qemu_thread_join(&s->thread); > @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s) > } > qemu_mutex_lock_iothread(); > > - multifd_save_cleanup(); > qemu_mutex_lock(&s->qemu_file_lock); > tmp = s->to_dst_file; > s->to_dst_file = NULL; > diff --git a/migration/multifd.c b/migration/multifd.c > index ec58c58082..9ca482cfbe 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -527,6 +527,9 @@ void multifd_save_cleanup(void) > > if (p->running) { > qemu_thread_join(&p->thread); > + } else { > + qemu_sem_post(&p->sem_sync); > + qemu_sem_post(&multifd_send_state->channels_ready); I think relying on p->running to join the thread is already problematic. Now all threads are created with JOINABLE, so we must join them to release the thread resources. Clearing "running" at the end of the thread is already wrong to me, because it means if the thread quits before the main thread reaching here, we will not join the thread, and the thread resource will be leaked. Here IMHO we should set p->running=true right before thread created, and never clear it. We may even want to rename it to p->thread_created? > } > } > for (i = 0; i < migrate_multifd_channels(); i++) { > @@ -751,8 +754,6 @@ out: > assert(local_err); > trace_multifd_send_error(p->id); > multifd_send_terminate_threads(local_err); > - qemu_sem_post(&p->sem_sync); > - qemu_sem_post(&multifd_send_state->channels_ready); > error_free(local_err); > } > > @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task, > > if (!qio_task_propagate_error(task, &err)) { > trace_multifd_tls_outgoing_handshake_complete(ioc); > - if (multifd_channel_connect(p, ioc, &err)) { > - return; > - } > + multifd_channel_connect(p, ioc, NULL); Ignoring Error** is not good.. I think you meant to say "it should never fail", then we should put &error_abort. Another cleaner way to do this is split the current multifd_channel_connect() into tls and non-tls helpers, then we can call the non-tls helpers here (which may not need an Error**). > + } else { > + /* > + * The multifd client could already be waiting to queue data, > + * so let it know that we didn't even start. > + */ > + p->quit = true; > + trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); > } > - > - trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); > - > - /* > - * Error happen, mark multifd_send_thread status as 'quit' although it > - * is not created, and then tell who pay attention to me. > - */ > - p->quit = true; > - qemu_sem_post(&multifd_send_state->channels_ready); > - qemu_sem_post(&p->sem_sync); > } > > static void *multifd_tls_handshake_thread(void *opaque) > @@ -862,9 +858,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p, > QIOChannel *ioc, Error *err) > { > migrate_set_error(migrate_get_current(), err); > - /* Error happen, we need to tell who pay attention to me */ > - qemu_sem_post(&multifd_send_state->channels_ready); > - qemu_sem_post(&p->sem_sync); > /* > * Although multifd_send_thread is not created, but main migration > * thread need to judge whether it is running, so we need to mark > -- I may still need some more time to digest your whole solution, currently not very clear to me. It may or may not be the patch's problem, though. But let me also share how I see this.. I think we should rework the multifd thread model on channel establishment. Firstly, as I mentioned above, we must always join() the threads if it's JOINABLE or the resource is leaked, afaict. That's the first thing to fix. Then let's see when TLS is there and what we do: we'll create "two" threads just for that, what's even worse: - We'll create tls handshake thread in multifd_tls_channel_connect() first, setting &p->thread. - Within the same thread, we do multifd_tls_outgoing_handshake() when handshake done -> multifd_channel_connect() -> we yet create the real multifd_send_thread(), setting &p->thread too. So AFAICT, the tls handshake thread is already leaked, got p->thread overwritten by the new thread created by itself.. I think we should fix this then. I haven't figured the best way to do, two things I can come up with now: 1) At least make the tls handshake thread detached. 2) Make the handshake done in multifd send thread directly; I don't yet see why we must create two threads.. Then assuming we have a clear model with all these threads issue fixed (no matter whether we'd shrink 2N threads into N threads), then what we need to do, IMHO, is making sure to join() all of them before destroying anything (say, per-channel MultiFDSendParams). Then when we destroy everything safely, either mutex/sem/etc.. Because no one will race us anymore. Would it make sense? Thanks,
Peter Xu <peterx@redhat.com> writes: > On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote: >> We cannot operate on the multifd semaphores outside of the multifd >> channel thread >> because multifd_save_cleanup() can run in parallel and >> attempt to destroy the mutexes, which causes an assert. >> >> Looking at the places where we use the semaphores aside from the >> migration thread, there's only the TLS handshake thread and the >> initial multifd_channel_connect() in the main thread. These are places >> where creating the multifd channel cannot fail, so releasing the >> semaphores at these places only serves the purpose of avoiding a >> deadlock when an error happens before the channel(s) have started, but >> after the migration thread has already called >> multifd_send_sync_main(). >> >> Instead of attempting to release the semaphores at those places, move >> the release into multifd_save_cleanup(). This puts the semaphore usage >> in the same thread that does the cleanup, eliminating the race. >> >> Move the call to multifd_save_cleanup() before joining for the >> migration thread so we release the semaphores before. >> >> Signed-off-by: Fabiano Rosas <farosas@suse.de> >> --- >> migration/migration.c | 4 +++- >> migration/multifd.c | 29 +++++++++++------------------ >> 2 files changed, 14 insertions(+), 19 deletions(-) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index cca32c553c..52be20561b 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s) >> QEMUFile *tmp; >> >> trace_migrate_fd_cleanup(); >> + >> + multifd_save_cleanup(); >> + >> qemu_mutex_unlock_iothread(); >> if (s->migration_thread_running) { >> qemu_thread_join(&s->thread); >> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s) >> } >> qemu_mutex_lock_iothread(); >> >> - multifd_save_cleanup(); >> qemu_mutex_lock(&s->qemu_file_lock); >> tmp = s->to_dst_file; >> s->to_dst_file = NULL; >> diff --git a/migration/multifd.c b/migration/multifd.c >> index ec58c58082..9ca482cfbe 100644 >> --- a/migration/multifd.c >> +++ b/migration/multifd.c >> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void) >> >> if (p->running) { >> qemu_thread_join(&p->thread); >> + } else { >> + qemu_sem_post(&p->sem_sync); >> + qemu_sem_post(&multifd_send_state->channels_ready); > > I think relying on p->running to join the thread is already problematic. > Absolutely. I've already complained about this in another thread. > Now all threads are created with JOINABLE, so we must join them to release > the thread resources. Clearing "running" at the end of the thread is > already wrong to me, because it means if the thread quits before the main > thread reaching here, we will not join the thread, and the thread resource > will be leaked. > > Here IMHO we should set p->running=true right before thread created, and > never clear it. We may even want to rename it to p->thread_created? > Ok, let me do that. I already have some patches touching multifd_new_send_channel_async() due to fixed-ram. >> } >> } >> for (i = 0; i < migrate_multifd_channels(); i++) { >> @@ -751,8 +754,6 @@ out: >> assert(local_err); >> trace_multifd_send_error(p->id); >> multifd_send_terminate_threads(local_err); >> - qemu_sem_post(&p->sem_sync); >> - qemu_sem_post(&multifd_send_state->channels_ready); >> error_free(local_err); >> } >> >> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task, >> >> if (!qio_task_propagate_error(task, &err)) { >> trace_multifd_tls_outgoing_handshake_complete(ioc); >> - if (multifd_channel_connect(p, ioc, &err)) { >> - return; >> - } >> + multifd_channel_connect(p, ioc, NULL); > > Ignoring Error** is not good.. > > I think you meant to say "it should never fail", then we should put > &error_abort. Another cleaner way to do this is split the current > multifd_channel_connect() into tls and non-tls helpers, then we can call > the non-tls helpers here (which may not need an Error**). I attempted the split and it looked awkward, so I let go. But I agree about the Error. >> + } else { >> + /* >> + * The multifd client could already be waiting to queue data, >> + * so let it know that we didn't even start. >> + */ >> + p->quit = true; >> + trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); >> } >> - >> - trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); >> - >> - /* >> - * Error happen, mark multifd_send_thread status as 'quit' although it >> - * is not created, and then tell who pay attention to me. >> - */ >> - p->quit = true; >> - qemu_sem_post(&multifd_send_state->channels_ready); >> - qemu_sem_post(&p->sem_sync); >> } >> >> static void *multifd_tls_handshake_thread(void *opaque) >> @@ -862,9 +858,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p, >> QIOChannel *ioc, Error *err) >> { >> migrate_set_error(migrate_get_current(), err); >> - /* Error happen, we need to tell who pay attention to me */ >> - qemu_sem_post(&multifd_send_state->channels_ready); >> - qemu_sem_post(&p->sem_sync); >> /* >> * Although multifd_send_thread is not created, but main migration >> * thread need to judge whether it is running, so we need to mark >> -- > > I may still need some more time to digest your whole solution, currently > not very clear to me. It may or may not be the patch's problem, > though. The core assumption of this patch is that we currently only need to release the semaphores because the multifd_send_sync_main() is allowed to execute even before the multifd channels have started. If creation failed to even start, for instance because of a TLS handshake failure, the migration thread will hang waiting for channels_ready (or sem_sync). Then there are two premises: - when an error happens, multifd_save_cleanup() is always called; - a hanging migration thread (at multifd_send_sync_main) will not stop the main thread from reaching multifd_save_cleanup. If this holds, then it's safe to release the semaphores right before destroying them at multifd_save_cleanup. > But let me also share how I see this.. I think we should rework the > multifd thread model on channel establishment. > > Firstly, as I mentioned above, we must always join() the threads if it's > JOINABLE or the resource is leaked, afaict. That's the first thing to fix. I think we historically stumbled upon the fact that qemu_thread_join() is not the same as pthread_join(). The former takes a pointer and is not safe to call with a NULL QemuThread. That seems to be the reason for the p->running check before it. > Then let's see when TLS is there and what we do: we'll create "two" threads > just for that, what's even worse: > > - We'll create tls handshake thread in multifd_tls_channel_connect() > first, setting &p->thread. > > - Within the same thread, we do multifd_tls_outgoing_handshake() when > handshake done -> multifd_channel_connect() -> we yet create the real > multifd_send_thread(), setting &p->thread too. Hmm good point, we're calling qio_task_complete() from within the thread, that's misleading. I believe using qio_task_run_in_thread() would put the completion callback in the main thread, right? That would look a bit better I think because we could then join the handshake thread before multifd_channel_connect(). > So AFAICT, the tls handshake thread is already leaked, got p->thread > overwritten by the new thread created by itself.. > > I think we should fix this then. I haven't figured the best way to do, > two things I can come up with now: > > 1) At least make the tls handshake thread detached. > > 2) Make the handshake done in multifd send thread directly; I don't yet > see why we must create two threads.. I'm (a little bit) against this. It's nice to know that a multifdsend_# thread will only be doing IO and no other task. I have the same concern about putting zero page detection in the multifd thread. > Then assuming we have a clear model with all these threads issue fixed (no > matter whether we'd shrink 2N threads into N threads), then what we need to > do, IMHO, is making sure to join() all of them before destroying anything > (say, per-channel MultiFDSendParams). Then when we destroy everything > safely, either mutex/sem/etc.. Because no one will race us anymore. This doesn't address the race. There's a data dependency between the multifd channels and the migration thread around the channels_ready semaphore. So we cannot join the migration thread because it could be stuck waiting for the semaphore, which means we cannot join+cleanup the channel thread because the semaphore is still being used. That's why I'm moving the semaphores post to the point where we join the thread. Reading your email, now I think that should be: qemu_sem_post(&p->sem_sync); qemu_sem_post(&multifd_send_state->channels_ready); qemu_thread_join(&p->thread); Fundamentally, the issue is that we do create the multifd threads before the migration thread, but they might not be ready before the migration thread needs to use them.
Fabiano Rosas <farosas@suse.de> writes: > Peter Xu <peterx@redhat.com> writes: > >> On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote: >>> We cannot operate on the multifd semaphores outside of the multifd >>> channel thread >>> because multifd_save_cleanup() can run in parallel and >>> attempt to destroy the mutexes, which causes an assert. >>> >>> Looking at the places where we use the semaphores aside from the >>> migration thread, there's only the TLS handshake thread and the >>> initial multifd_channel_connect() in the main thread. These are places >>> where creating the multifd channel cannot fail, so releasing the >>> semaphores at these places only serves the purpose of avoiding a >>> deadlock when an error happens before the channel(s) have started, but >>> after the migration thread has already called >>> multifd_send_sync_main(). >>> >>> Instead of attempting to release the semaphores at those places, move >>> the release into multifd_save_cleanup(). This puts the semaphore usage >>> in the same thread that does the cleanup, eliminating the race. >>> >>> Move the call to multifd_save_cleanup() before joining for the >>> migration thread so we release the semaphores before. >>> >>> Signed-off-by: Fabiano Rosas <farosas@suse.de> >>> --- >>> migration/migration.c | 4 +++- >>> migration/multifd.c | 29 +++++++++++------------------ >>> 2 files changed, 14 insertions(+), 19 deletions(-) >>> >>> diff --git a/migration/migration.c b/migration/migration.c >>> index cca32c553c..52be20561b 100644 >>> --- a/migration/migration.c >>> +++ b/migration/migration.c >>> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s) >>> QEMUFile *tmp; >>> >>> trace_migrate_fd_cleanup(); >>> + >>> + multifd_save_cleanup(); >>> + >>> qemu_mutex_unlock_iothread(); >>> if (s->migration_thread_running) { >>> qemu_thread_join(&s->thread); >>> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s) >>> } >>> qemu_mutex_lock_iothread(); >>> >>> - multifd_save_cleanup(); >>> qemu_mutex_lock(&s->qemu_file_lock); >>> tmp = s->to_dst_file; >>> s->to_dst_file = NULL; >>> diff --git a/migration/multifd.c b/migration/multifd.c >>> index ec58c58082..9ca482cfbe 100644 >>> --- a/migration/multifd.c >>> +++ b/migration/multifd.c >>> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void) >>> >>> if (p->running) { >>> qemu_thread_join(&p->thread); >>> + } else { >>> + qemu_sem_post(&p->sem_sync); >>> + qemu_sem_post(&multifd_send_state->channels_ready); >> >> I think relying on p->running to join the thread is already problematic. >> > > Absolutely. I've already complained about this in another thread. > >> Now all threads are created with JOINABLE, so we must join them to release >> the thread resources. Clearing "running" at the end of the thread is >> already wrong to me, because it means if the thread quits before the main >> thread reaching here, we will not join the thread, and the thread resource >> will be leaked. >> >> Here IMHO we should set p->running=true right before thread created, and >> never clear it. We may even want to rename it to p->thread_created? >> > > Ok, let me do that. I already have some patches touching > multifd_new_send_channel_async() due to fixed-ram. > >>> } >>> } >>> for (i = 0; i < migrate_multifd_channels(); i++) { >>> @@ -751,8 +754,6 @@ out: >>> assert(local_err); >>> trace_multifd_send_error(p->id); >>> multifd_send_terminate_threads(local_err); >>> - qemu_sem_post(&p->sem_sync); >>> - qemu_sem_post(&multifd_send_state->channels_ready); >>> error_free(local_err); >>> } >>> >>> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task, >>> >>> if (!qio_task_propagate_error(task, &err)) { >>> trace_multifd_tls_outgoing_handshake_complete(ioc); >>> - if (multifd_channel_connect(p, ioc, &err)) { >>> - return; >>> - } >>> + multifd_channel_connect(p, ioc, NULL); >> >> Ignoring Error** is not good.. >> >> I think you meant to say "it should never fail", then we should put >> &error_abort. Another cleaner way to do this is split the current >> multifd_channel_connect() into tls and non-tls helpers, then we can call >> the non-tls helpers here (which may not need an Error**). > > I attempted the split and it looked awkward, so I let go. But I agree > about the Error. > >>> + } else { >>> + /* >>> + * The multifd client could already be waiting to queue data, >>> + * so let it know that we didn't even start. >>> + */ >>> + p->quit = true; >>> + trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); >>> } >>> - >>> - trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); >>> - >>> - /* >>> - * Error happen, mark multifd_send_thread status as 'quit' although it >>> - * is not created, and then tell who pay attention to me. >>> - */ >>> - p->quit = true; >>> - qemu_sem_post(&multifd_send_state->channels_ready); >>> - qemu_sem_post(&p->sem_sync); >>> } >>> >>> static void *multifd_tls_handshake_thread(void *opaque) >>> @@ -862,9 +858,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p, >>> QIOChannel *ioc, Error *err) >>> { >>> migrate_set_error(migrate_get_current(), err); >>> - /* Error happen, we need to tell who pay attention to me */ >>> - qemu_sem_post(&multifd_send_state->channels_ready); >>> - qemu_sem_post(&p->sem_sync); >>> /* >>> * Although multifd_send_thread is not created, but main migration >>> * thread need to judge whether it is running, so we need to mark >>> -- >> >> I may still need some more time to digest your whole solution, currently >> not very clear to me. It may or may not be the patch's problem, >> though. > > The core assumption of this patch is that we currently only need to > release the semaphores because the multifd_send_sync_main() is allowed > to execute even before the multifd channels have started. If creation > failed to even start, for instance because of a TLS handshake failure, > the migration thread will hang waiting for channels_ready (or sem_sync). > > Then there are two premises: > > - when an error happens, multifd_save_cleanup() is always called; > > - a hanging migration thread (at multifd_send_sync_main) will not stop > the main thread from reaching multifd_save_cleanup. > > If this holds, then it's safe to release the semaphores right before > destroying them at multifd_save_cleanup. > >> But let me also share how I see this.. I think we should rework the >> multifd thread model on channel establishment. >> >> Firstly, as I mentioned above, we must always join() the threads if it's >> JOINABLE or the resource is leaked, afaict. That's the first thing to fix. > > I think we historically stumbled upon the fact that qemu_thread_join() > is not the same as pthread_join(). The former takes a pointer and is not > safe to call with a NULL QemuThread. That seems to be the reason for the > p->running check before it. Scratch this part, the QemuThread is not a pointer. ...should it be? Because then we can test p->thread instead of p->running, which would be more precise and would dispense the thread_created flag. >> Then let's see when TLS is there and what we do: we'll create "two" threads >> just for that, what's even worse: >> >> - We'll create tls handshake thread in multifd_tls_channel_connect() >> first, setting &p->thread. >> >> - Within the same thread, we do multifd_tls_outgoing_handshake() when >> handshake done -> multifd_channel_connect() -> we yet create the real >> multifd_send_thread(), setting &p->thread too. > > Hmm good point, we're calling qio_task_complete() from within the > thread, that's misleading. I believe using qio_task_run_in_thread() > would put the completion callback in the main thread, right? That would > look a bit better I think because we could then join the handshake > thread before multifd_channel_connect(). > >> So AFAICT, the tls handshake thread is already leaked, got p->thread >> overwritten by the new thread created by itself.. >> >> I think we should fix this then. I haven't figured the best way to do, >> two things I can come up with now: >> >> 1) At least make the tls handshake thread detached. >> >> 2) Make the handshake done in multifd send thread directly; I don't yet >> see why we must create two threads.. > > I'm (a little bit) against this. It's nice to know that a multifdsend_# > thread will only be doing IO and no other task. I have the same concern > about putting zero page detection in the multifd thread. > >> Then assuming we have a clear model with all these threads issue fixed (no >> matter whether we'd shrink 2N threads into N threads), then what we need to >> do, IMHO, is making sure to join() all of them before destroying anything >> (say, per-channel MultiFDSendParams). Then when we destroy everything >> safely, either mutex/sem/etc.. Because no one will race us anymore. > > This doesn't address the race. There's a data dependency between the > multifd channels and the migration thread around the channels_ready > semaphore. So we cannot join the migration thread because it could be > stuck waiting for the semaphore, which means we cannot join+cleanup the > channel thread because the semaphore is still being used. That's why I'm > moving the semaphores post to the point where we join the > thread. Reading your email, now I think that should be: > > qemu_sem_post(&p->sem_sync); > qemu_sem_post(&multifd_send_state->channels_ready); > qemu_thread_join(&p->thread); > > Fundamentally, the issue is that we do create the multifd threads before > the migration thread, but they might not be ready before the migration > thread needs to use them.
On Fri, Nov 10, 2023 at 09:05:41AM -0300, Fabiano Rosas wrote: [...] > > Then assuming we have a clear model with all these threads issue fixed (no > > matter whether we'd shrink 2N threads into N threads), then what we need to > > do, IMHO, is making sure to join() all of them before destroying anything > > (say, per-channel MultiFDSendParams). Then when we destroy everything > > safely, either mutex/sem/etc.. Because no one will race us anymore. > > This doesn't address the race. There's a data dependency between the > multifd channels and the migration thread around the channels_ready > semaphore. So we cannot join the migration thread because it could be > stuck waiting for the semaphore, which means we cannot join+cleanup the > channel thread because the semaphore is still being used. I think this is the major part of confusion, on why this can happen. The problem is afaik multifd_save_cleanup() is only called by migrate_fd_cleanup(), which is further only called in: 1) migrate_fd_cleanup_bh() 2) migrate_fd_connect() For 1): it's only run when migration comletes/fails/etc. (in all cases, right before it quits..) and then kicks off migrate_fd_cleanup_schedule(). So migration thread shouldn't be stuck, afaiu, or it won't be able to kick that BH. For 2): it's called by the main thread, where migration thread should have not yet been created. With that, I don't see how migrate_fd_cleanup() would need to worry about migration thread Did I miss something? Thanks,
Peter Xu <peterx@redhat.com> writes: > On Fri, Nov 10, 2023 at 09:05:41AM -0300, Fabiano Rosas wrote: > > [...] > >> > Then assuming we have a clear model with all these threads issue fixed (no >> > matter whether we'd shrink 2N threads into N threads), then what we need to >> > do, IMHO, is making sure to join() all of them before destroying anything >> > (say, per-channel MultiFDSendParams). Then when we destroy everything >> > safely, either mutex/sem/etc.. Because no one will race us anymore. >> >> This doesn't address the race. There's a data dependency between the >> multifd channels and the migration thread around the channels_ready >> semaphore. So we cannot join the migration thread because it could be >> stuck waiting for the semaphore, which means we cannot join+cleanup the >> channel thread because the semaphore is still being used. > > I think this is the major part of confusion, on why this can happen. > > The problem is afaik multifd_save_cleanup() is only called by > migrate_fd_cleanup(), which is further only called in: > > 1) migrate_fd_cleanup_bh() > 2) migrate_fd_connect() > > For 1): it's only run when migration comletes/fails/etc. (in all cases, > right before it quits..) and then kicks off migrate_fd_cleanup_schedule(). > So migration thread shouldn't be stuck, afaiu, or it won't be able to kick > that BH. > > For 2): it's called by the main thread, where migration thread should have > not yet been created. > > With that, I don't see how migrate_fd_cleanup() would need to worry about > migration thread > > Did I miss something? There are two points: 1) multifd_new_send_channel_async() doesn't set an Error. Even if multifd_channel_connect() fails, we'll still continue with migrate_fd_connect(). I don't see any code that looks at the migration error (s->error). 2) the TLS handshake thread of one of the channels could simply not get any chance to run until something else fails and we reach multifd_save_cleanup() from the BH path. This second point in particular is why I don't think simply joining the TLS thread will avoid the race. There's nothing linking the multifd channels together, we could have 7 of them operational and a 8th one still going through the TLS handshake. That said, I'm not sure about the exact path we take to reach the bug situation. It's very hard to reproduce so I'm relying entirely on code inspection.
On Mon, Nov 13, 2023 at 10:50:39PM -0300, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Fri, Nov 10, 2023 at 09:05:41AM -0300, Fabiano Rosas wrote: > > > > [...] > > > >> > Then assuming we have a clear model with all these threads issue fixed (no > >> > matter whether we'd shrink 2N threads into N threads), then what we need to > >> > do, IMHO, is making sure to join() all of them before destroying anything > >> > (say, per-channel MultiFDSendParams). Then when we destroy everything > >> > safely, either mutex/sem/etc.. Because no one will race us anymore. > >> > >> This doesn't address the race. There's a data dependency between the > >> multifd channels and the migration thread around the channels_ready > >> semaphore. So we cannot join the migration thread because it could be > >> stuck waiting for the semaphore, which means we cannot join+cleanup the > >> channel thread because the semaphore is still being used. > > > > I think this is the major part of confusion, on why this can happen. > > > > The problem is afaik multifd_save_cleanup() is only called by > > migrate_fd_cleanup(), which is further only called in: > > > > 1) migrate_fd_cleanup_bh() > > 2) migrate_fd_connect() > > > > For 1): it's only run when migration comletes/fails/etc. (in all cases, > > right before it quits..) and then kicks off migrate_fd_cleanup_schedule(). > > So migration thread shouldn't be stuck, afaiu, or it won't be able to kick > > that BH. > > > > For 2): it's called by the main thread, where migration thread should have > > not yet been created. > > > > With that, I don't see how migrate_fd_cleanup() would need to worry about > > migration thread > > > > Did I miss something? > > There are two points: > > 1) multifd_new_send_channel_async() doesn't set an Error. Even if > multifd_channel_connect() fails, we'll still continue with > migrate_fd_connect(). I don't see any code that looks at the migration > error (s->error). > > 2) the TLS handshake thread of one of the channels could simply not get > any chance to run until something else fails and we reach > multifd_save_cleanup() from the BH path. > > This second point in particular is why I don't think simply joining the > TLS thread will avoid the race. There's nothing linking the multifd > channels together, we could have 7 of them operational and a 8th one > still going through the TLS handshake. > > That said, I'm not sure about the exact path we take to reach the bug > situation. It's very hard to reproduce so I'm relying entirely on code > inspection. This whole patch, iiuc, was trying to move sem post, which will only kick the migration thread at different places. IMHO, the problem is if multifd_save_cleanup() is only called in either migrate_fd_connect() or the BH as discussed above, then it means migration thread is already gone! I don't see how moving that sem_post() would help in any form. It means whatever thread stuck as you said can still happen. You're right that I think the thread is just fully out of control of migration. Namely, anything we created with socket_send_channel_create(). Fundamentally, I think it's because qio_task_run_in_thread() doesn't support that control, as creating its thread as DETACHED already, leaving the rest to luck. Can we provide our own threads for that, at least keep the threadID around, then at cleanup paths we can shutdown() the IOChannels and join(), assuming that the shutdown() will kick the thread out of any blocked IO?
Peter Xu <peterx@redhat.com> wrote: > On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote: >> We cannot operate on the multifd semaphores outside of the multifd >> channel thread >> because multifd_save_cleanup() can run in parallel and >> attempt to destroy the mutexes, which causes an assert. >> >> Looking at the places where we use the semaphores aside from the >> migration thread, there's only the TLS handshake thread and the >> initial multifd_channel_connect() in the main thread. These are places >> where creating the multifd channel cannot fail, so releasing the >> semaphores at these places only serves the purpose of avoiding a >> deadlock when an error happens before the channel(s) have started, but >> after the migration thread has already called >> multifd_send_sync_main(). >> >> Instead of attempting to release the semaphores at those places, move >> the release into multifd_save_cleanup(). This puts the semaphore usage >> in the same thread that does the cleanup, eliminating the race. >> >> Move the call to multifd_save_cleanup() before joining for the >> migration thread so we release the semaphores before. >> >> Signed-off-by: Fabiano Rosas <farosas@suse.de> >> --- >> migration/migration.c | 4 +++- >> migration/multifd.c | 29 +++++++++++------------------ >> 2 files changed, 14 insertions(+), 19 deletions(-) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index cca32c553c..52be20561b 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s) >> QEMUFile *tmp; >> >> trace_migrate_fd_cleanup(); >> + >> + multifd_save_cleanup(); >> + >> qemu_mutex_unlock_iothread(); >> if (s->migration_thread_running) { >> qemu_thread_join(&s->thread); >> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s) >> } >> qemu_mutex_lock_iothread(); >> >> - multifd_save_cleanup(); >> qemu_mutex_lock(&s->qemu_file_lock); >> tmp = s->to_dst_file; >> s->to_dst_file = NULL; >> diff --git a/migration/multifd.c b/migration/multifd.c >> index ec58c58082..9ca482cfbe 100644 >> --- a/migration/multifd.c >> +++ b/migration/multifd.c >> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void) >> >> if (p->running) { >> qemu_thread_join(&p->thread); >> + } else { >> + qemu_sem_post(&p->sem_sync); >> + qemu_sem_post(&multifd_send_state->channels_ready); > > I think relying on p->running to join the thread is already problematic. Why? > Now all threads are created with JOINABLE, so we must join them to release > the thread resources. Clearing "running" at the end of the thread is > already wrong to me, because it means if the thread quits before the main > thread reaching here, we will not join the thread, and the thread resource > will be leaked. The bug is that the thread quits from other place. It is not different that forgetting to do a mutex_unlock(). It is an error that needs fixing. > Here IMHO we should set p->running=true right before thread created, (that is basically what is done) Meaning of ->running is that multifd thread has started running. it a false is that it is not running normally anymore. thread function: > and never clear it. static void *multifd_send_thread(void *opaque) { // No error can happens here while (true) { // No return command here, just breaks } out: // This can fail qemu_mutex_lock(&p->mutex); p->running = false; qemu_mutex_unlock(&p->mutex); /* this can't fail */ return NULL; } What running here means is that we don't need to "stop" this thread anymore. That happens as soon as we get out of the loop. > We may even want to rename it to p->thread_created? We can rename it, but I am not sure if it buys it too much. Notice that we also mean that we have created the channel. > >> } >> } >> for (i = 0; i < migrate_multifd_channels(); i++) { >> @@ -751,8 +754,6 @@ out: >> assert(local_err); >> trace_multifd_send_error(p->id); >> multifd_send_terminate_threads(local_err); >> - qemu_sem_post(&p->sem_sync); >> - qemu_sem_post(&multifd_send_state->channels_ready); >> error_free(local_err); >> } >> >> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task, >> >> if (!qio_task_propagate_error(task, &err)) { >> trace_multifd_tls_outgoing_handshake_complete(ioc); >> - if (multifd_channel_connect(p, ioc, &err)) { >> - return; >> - } >> + multifd_channel_connect(p, ioc, NULL); > > Ignoring Error** is not good.. > > I think you meant to say "it should never fail", then we should put > &error_abort. Another cleaner way to do this is split the current > multifd_channel_connect() into tls and non-tls helpers, then we can call > the non-tls helpers here (which may not need an Error**). This code is really weird because it is (very) asynchronous. And TLS is even worse, because we need to do the equilent of two listen(). Sniff. >> + } else { >> + /* >> + * The multifd client could already be waiting to queue data, >> + * so let it know that we didn't even start. >> + */ >> + p->quit = true; >> + trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); >> } >> - >> - trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); >> - >> - /* >> - * Error happen, mark multifd_send_thread status as 'quit' although it >> - * is not created, and then tell who pay attention to me. >> - */ >> - p->quit = true; >> - qemu_sem_post(&multifd_send_state->channels_ready); >> - qemu_sem_post(&p->sem_sync); >> } >> >> static void *multifd_tls_handshake_thread(void *opaque) >> @@ -862,9 +858,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p, >> QIOChannel *ioc, Error *err) >> { >> migrate_set_error(migrate_get_current(), err); >> - /* Error happen, we need to tell who pay attention to me */ >> - qemu_sem_post(&multifd_send_state->channels_ready); >> - qemu_sem_post(&p->sem_sync); >> /* >> * Although multifd_send_thread is not created, but main migration >> * thread need to judge whether it is running, so we need to mark >> -- > > I may still need some more time to digest your whole solution, currently > not very clear to me. It may or may not be the patch's problem, though. > > But let me also share how I see this.. I think we should rework the > multifd thread model on channel establishment. > > Firstly, as I mentioned above, we must always join() the threads if it's > JOINABLE or the resource is leaked, afaict. That's the first thing to fix. The problem that running is trying to fix is this: * we get an additional error while we are exiting due to any reason. So we end calling multifd_fd_cleanup() several times, and things get really weird. So what we really need is to be sure that: - we join each thread once and only once https://linux.die.net/man/3/pthread_join Joining with a thread that has previously been joined results in undefined behavior. > Then let's see when TLS is there and what we do: we'll create "two" threads > just for that, what's even worse: > > - We'll create tls handshake thread in multifd_tls_channel_connect() > first, setting &p->thread. > > - Within the same thread, we do multifd_tls_outgoing_handshake() when > handshake done -> multifd_channel_connect() -> we yet create the real > multifd_send_thread(), setting &p->thread too. > > So AFAICT, the tls handshake thread is already leaked, got p->thread > overwritten by the new thread created by itself.. > > I think we should fix this then. I haven't figured the best way to do, > two things I can come up with now: > > 1) At least make the tls handshake thread detached. > > 2) Make the handshake done in multifd send thread directly; I don't yet > see why we must create two threads.. The way we listen() in sockets anynchronously. QIO channels are a mystery to me, and this was really weird. Using glib_main_loop() has lots of advantages. Understanding how to listen asynchronously is not one of them. > Then assuming we have a clear model with all these threads issue fixed (no > matter whether we'd shrink 2N threads into N threads), then what we need to > do, IMHO, is making sure to join() all of them before destroying anything > (say, per-channel MultiFDSendParams). Then when we destroy everything > safely, either mutex/sem/etc.. Because no one will race us anymore. > Would it make sense? I can't see why we need to join the threads created for tls. That ones could be not joinable and just call it a day. Later, Juan.
Fabiano Rosas <farosas@suse.de> wrote: > Peter Xu <peterx@redhat.com> writes: > >> On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote: >>> We cannot operate on the multifd semaphores outside of the multifd >>> channel thread >>> because multifd_save_cleanup() can run in parallel and >>> attempt to destroy the mutexes, which causes an assert. >>> >>> Looking at the places where we use the semaphores aside from the >>> migration thread, there's only the TLS handshake thread and the >>> initial multifd_channel_connect() in the main thread. These are places >>> where creating the multifd channel cannot fail, so releasing the >>> semaphores at these places only serves the purpose of avoiding a >>> deadlock when an error happens before the channel(s) have started, but >>> after the migration thread has already called >>> multifd_send_sync_main(). >>> >>> Instead of attempting to release the semaphores at those places, move >>> the release into multifd_save_cleanup(). This puts the semaphore usage >>> in the same thread that does the cleanup, eliminating the race. >>> >>> Move the call to multifd_save_cleanup() before joining for the >>> migration thread so we release the semaphores before. >>> >>> Signed-off-by: Fabiano Rosas <farosas@suse.de> >>> --- >>> migration/migration.c | 4 +++- >>> migration/multifd.c | 29 +++++++++++------------------ >>> 2 files changed, 14 insertions(+), 19 deletions(-) >>> >>> diff --git a/migration/migration.c b/migration/migration.c >>> index cca32c553c..52be20561b 100644 >>> --- a/migration/migration.c >>> +++ b/migration/migration.c >>> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s) >>> QEMUFile *tmp; >>> >>> trace_migrate_fd_cleanup(); >>> + >>> + multifd_save_cleanup(); >>> + >>> qemu_mutex_unlock_iothread(); >>> if (s->migration_thread_running) { >>> qemu_thread_join(&s->thread); >>> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s) >>> } >>> qemu_mutex_lock_iothread(); >>> >>> - multifd_save_cleanup(); >>> qemu_mutex_lock(&s->qemu_file_lock); >>> tmp = s->to_dst_file; >>> s->to_dst_file = NULL; >>> diff --git a/migration/multifd.c b/migration/multifd.c >>> index ec58c58082..9ca482cfbe 100644 >>> --- a/migration/multifd.c >>> +++ b/migration/multifd.c >>> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void) >>> >>> if (p->running) { >>> qemu_thread_join(&p->thread); >>> + } else { >>> + qemu_sem_post(&p->sem_sync); >>> + qemu_sem_post(&multifd_send_state->channels_ready); >> >> I think relying on p->running to join the thread is already problematic. >> > > Absolutely. I've already complained about this in another thread. I haven't seen that yet. But if there is a bug, of course that we need to fix it. >> Now all threads are created with JOINABLE, so we must join them to release >> the thread resources. Clearing "running" at the end of the thread is >> already wrong to me, because it means if the thread quits before the main >> thread reaching here, we will not join the thread, and the thread resource >> will be leaked. >> >> Here IMHO we should set p->running=true right before thread created, and >> never clear it. We may even want to rename it to p->thread_created? >> > > Ok, let me do that. I already have some patches touching > multifd_new_send_channel_async() due to fixed-ram. I still think this is a bad idea. What that value means is if we have to send a post/whatever or not. What I could agree is that when we put ->running to false, we ceauld also shutdown the QIO channel. >>> } >>> } >>> for (i = 0; i < migrate_multifd_channels(); i++) { >>> @@ -751,8 +754,6 @@ out: >>> assert(local_err); >>> trace_multifd_send_error(p->id); >>> multifd_send_terminate_threads(local_err); >>> - qemu_sem_post(&p->sem_sync); >>> - qemu_sem_post(&multifd_send_state->channels_ready); >>> error_free(local_err); >>> } >>> >>> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task, >>> >>> if (!qio_task_propagate_error(task, &err)) { >>> trace_multifd_tls_outgoing_handshake_complete(ioc); >>> - if (multifd_channel_connect(p, ioc, &err)) { >>> - return; >>> - } >>> + multifd_channel_connect(p, ioc, NULL); >> >> Ignoring Error** is not good.. >> >> I think you meant to say "it should never fail", then we should put >> &error_abort. Another cleaner way to do this is split the current >> multifd_channel_connect() into tls and non-tls helpers, then we can call >> the non-tls helpers here (which may not need an Error**). > > I attempted the split and it looked awkward, so I let go. But I agree > about the Error. As said, that code is really weird because tcp+tls is basically two listen(). I never understood that part of the code, I think it was daniel who wrote it. >> I may still need some more time to digest your whole solution, currently >> not very clear to me. It may or may not be the patch's problem, >> though. > > The core assumption of this patch is that we currently only need to > release the semaphores because the multifd_send_sync_main() is allowed > to execute even before the multifd channels have started. If creation > failed to even start, for instance because of a TLS handshake failure, > the migration thread will hang waiting for channels_ready (or sem_sync). > > Then there are two premises: > > - when an error happens, multifd_save_cleanup() is always called; That should be happen. The problem is that nothing guarantees that we only have one error. Or that while we are sending something test migration_has_error() and decides to stop also on the error path. This needs to be audited. I agree that we can do better here, but it is not trivial. Full change is 9.0 material. > - a hanging migration thread (at multifd_send_sync_main) will not stop > the main thread from reaching multifd_save_cleanup. I don't remember to be fair. But I agree that we shouldn't wait. > If this holds, then it's safe to release the semaphores right before > destroying them at multifd_save_cleanup. >> But let me also share how I see this.. I think we should rework the >> multifd thread model on channel establishment. >> >> Firstly, as I mentioned above, we must always join() the threads if it's >> JOINABLE or the resource is leaked, afaict. That's the first thing to fix. > > I think we historically stumbled upon the fact that qemu_thread_join() > is not the same as pthread_join(). The former takes a pointer and is not > safe to call with a NULL QemuThread. That seems to be the reason for the > p->running check before it. As I pointed on the other thread, Joining a thread twice is undefined behaviour, so we need to be careful here. We return NULL, so it is not clear to me that we need to do anything else there. We could return an Error, but the problem here is that we want to tell the user what is the 1st error, not any other error. >> Then let's see when TLS is there and what we do: we'll create "two" threads >> just for that, what's even worse: >> >> - We'll create tls handshake thread in multifd_tls_channel_connect() >> first, setting &p->thread. >> >> - Within the same thread, we do multifd_tls_outgoing_handshake() when >> handshake done -> multifd_channel_connect() -> we yet create the real >> multifd_send_thread(), setting &p->thread too. > > Hmm good point, we're calling qio_task_complete() from within the > thread, that's misleading. I believe using qio_task_run_in_thread() > would put the completion callback in the main thread, right? That would > look a bit better I think because we could then join the handshake > thread before multifd_channel_connect(). >> So AFAICT, the tls handshake thread is already leaked, got p->thread >> overwritten by the new thread created by itself.. >> >> I think we should fix this then. I haven't figured the best way to do, >> two things I can come up with now: >> >> 1) At least make the tls handshake thread detached. >> >> 2) Make the handshake done in multifd send thread directly; I don't yet >> see why we must create two threads.. > > I'm (a little bit) against this. It's nice to know that a multifdsend_# > thread will only be doing IO and no other task. I have the same concern > about putting zero page detection in the multifd thread. > >> Then assuming we have a clear model with all these threads issue fixed (no >> matter whether we'd shrink 2N threads into N threads), then what we need to >> do, IMHO, is making sure to join() all of them before destroying anything >> (say, per-channel MultiFDSendParams). Then when we destroy everything >> safely, either mutex/sem/etc.. Because no one will race us anymore. > > This doesn't address the race. There's a data dependency between the > multifd channels and the migration thread around the channels_ready > semaphore. So we cannot join the migration thread because it could be > stuck waiting for the semaphore, which means we cannot join+cleanup the > channel thread because the semaphore is still being used. That's why I'm > moving the semaphores post to the point where we join the > thread. Reading your email, now I think that should be: > > qemu_sem_post(&p->sem_sync); > qemu_sem_post(&multifd_send_state->channels_ready); > qemu_thread_join(&p->thread); > > Fundamentally, the issue is that we do create the multifd threads before > the migration thread, but they might not be ready before the migration > thread needs to use them. Welcome the the async world. But I agree that your idea could work. Later, Juan.
Fabiano Rosas <farosas@suse.de> wrote: > Fabiano Rosas <farosas@suse.de> writes: > >> Peter Xu <peterx@redhat.com> writes: >> >>> On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote: >> I think we historically stumbled upon the fact that qemu_thread_join() >> is not the same as pthread_join(). The former takes a pointer and is not >> safe to call with a NULL QemuThread. That seems to be the reason for the >> p->running check before it. > > Scratch this part, the QemuThread is not a pointer. > > ...should it be? Because then we can test p->thread instead of > p->running, which would be more precise and would dispense the > thread_created flag. You still need to make sure that you don't join the thread twice. And we do the qemu_pthread_join() without any lock. Later, Juan.
Juan Quintela <quintela@redhat.com> writes: > Peter Xu <peterx@redhat.com> wrote: >> On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote: >>> We cannot operate on the multifd semaphores outside of the multifd >>> channel thread >>> because multifd_save_cleanup() can run in parallel and >>> attempt to destroy the mutexes, which causes an assert. >>> >>> Looking at the places where we use the semaphores aside from the >>> migration thread, there's only the TLS handshake thread and the >>> initial multifd_channel_connect() in the main thread. These are places >>> where creating the multifd channel cannot fail, so releasing the >>> semaphores at these places only serves the purpose of avoiding a >>> deadlock when an error happens before the channel(s) have started, but >>> after the migration thread has already called >>> multifd_send_sync_main(). >>> >>> Instead of attempting to release the semaphores at those places, move >>> the release into multifd_save_cleanup(). This puts the semaphore usage >>> in the same thread that does the cleanup, eliminating the race. >>> >>> Move the call to multifd_save_cleanup() before joining for the >>> migration thread so we release the semaphores before. >>> >>> Signed-off-by: Fabiano Rosas <farosas@suse.de> >>> --- >>> migration/migration.c | 4 +++- >>> migration/multifd.c | 29 +++++++++++------------------ >>> 2 files changed, 14 insertions(+), 19 deletions(-) >>> >>> diff --git a/migration/migration.c b/migration/migration.c >>> index cca32c553c..52be20561b 100644 >>> --- a/migration/migration.c >>> +++ b/migration/migration.c >>> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s) >>> QEMUFile *tmp; >>> >>> trace_migrate_fd_cleanup(); >>> + >>> + multifd_save_cleanup(); >>> + >>> qemu_mutex_unlock_iothread(); >>> if (s->migration_thread_running) { >>> qemu_thread_join(&s->thread); >>> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s) >>> } >>> qemu_mutex_lock_iothread(); >>> >>> - multifd_save_cleanup(); >>> qemu_mutex_lock(&s->qemu_file_lock); >>> tmp = s->to_dst_file; >>> s->to_dst_file = NULL; >>> diff --git a/migration/multifd.c b/migration/multifd.c >>> index ec58c58082..9ca482cfbe 100644 >>> --- a/migration/multifd.c >>> +++ b/migration/multifd.c >>> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void) >>> >>> if (p->running) { >>> qemu_thread_join(&p->thread); >>> + } else { >>> + qemu_sem_post(&p->sem_sync); >>> + qemu_sem_post(&multifd_send_state->channels_ready); >> >> I think relying on p->running to join the thread is already problematic. > > Why? The channel holds resources that need to be freed by multifd_save_cleanup(). We cannot reliably do so while the channel itself is the one responsible for telling the main thread whether it is done with those resources. p->running works fine for knowing "has the thread been created", so we can use it to avoid joining if it was never created. But it is bad for knowing "is the thread still running" because as soon as the channel sets p->running=false, multifd_save_cleanup() could attempt to destroy the p->mutex and the semaphores. The bug reported in the other series was about this already: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup https://lore.kernel.org/r/20230621081826.3203053-1-zhangjianguo18@huawei.com >> Now all threads are created with JOINABLE, so we must join them to release >> the thread resources. Clearing "running" at the end of the thread is >> already wrong to me, because it means if the thread quits before the main >> thread reaching here, we will not join the thread, and the thread resource >> will be leaked. > > The bug is that the thread quits from other place. > It is not different that forgetting to do a mutex_unlock(). It is an > error that needs fixing. > >> Here IMHO we should set p->running=true right before thread created, > > (that is basically what is done) > > Meaning of ->running is that multifd thread has started running. it a > false is that it is not running normally anymore. The issue is that multifd_save_cleanup looks at p->running = false and continues with the cleanup that destroys the mutex while the TLS thread might be using it. > > thread function: > >> and never clear it. > > static void *multifd_send_thread(void *opaque) > { > // No error can happens here > > while (true) { > // No return command here, just breaks > } > > out: > // This can fail > > qemu_mutex_lock(&p->mutex); > p->running = false; > qemu_mutex_unlock(&p->mutex); > > /* this can't fail */ > > return NULL; > } > > > What running here means is that we don't need to "stop" this thread > anymore. That happens as soon as we get out of the loop. > >> We may even want to rename it to p->thread_created? > > We can rename it, but I am not sure if it buys it too much. Notice that > we also mean that we have created the channel. > >> >>> } >>> } >>> for (i = 0; i < migrate_multifd_channels(); i++) { >>> @@ -751,8 +754,6 @@ out: >>> assert(local_err); >>> trace_multifd_send_error(p->id); >>> multifd_send_terminate_threads(local_err); >>> - qemu_sem_post(&p->sem_sync); >>> - qemu_sem_post(&multifd_send_state->channels_ready); >>> error_free(local_err); >>> } >>> >>> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task, >>> >>> if (!qio_task_propagate_error(task, &err)) { >>> trace_multifd_tls_outgoing_handshake_complete(ioc); >>> - if (multifd_channel_connect(p, ioc, &err)) { >>> - return; >>> - } >>> + multifd_channel_connect(p, ioc, NULL); >> >> Ignoring Error** is not good.. >> >> I think you meant to say "it should never fail", then we should put >> &error_abort. Another cleaner way to do this is split the current >> multifd_channel_connect() into tls and non-tls helpers, then we can call >> the non-tls helpers here (which may not need an Error**). > > This code is really weird because it is (very) asynchronous. > And TLS is even worse, because we need to do the equilent of two > listen(). > Sniff. > >>> + } else { >>> + /* >>> + * The multifd client could already be waiting to queue data, >>> + * so let it know that we didn't even start. >>> + */ >>> + p->quit = true; >>> + trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); >>> } >>> - >>> - trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); >>> - >>> - /* >>> - * Error happen, mark multifd_send_thread status as 'quit' although it >>> - * is not created, and then tell who pay attention to me. >>> - */ >>> - p->quit = true; >>> - qemu_sem_post(&multifd_send_state->channels_ready); >>> - qemu_sem_post(&p->sem_sync); >>> } >>> >>> static void *multifd_tls_handshake_thread(void *opaque) >>> @@ -862,9 +858,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p, >>> QIOChannel *ioc, Error *err) >>> { >>> migrate_set_error(migrate_get_current(), err); >>> - /* Error happen, we need to tell who pay attention to me */ >>> - qemu_sem_post(&multifd_send_state->channels_ready); >>> - qemu_sem_post(&p->sem_sync); >>> /* >>> * Although multifd_send_thread is not created, but main migration >>> * thread need to judge whether it is running, so we need to mark >>> -- >> >> I may still need some more time to digest your whole solution, currently >> not very clear to me. It may or may not be the patch's problem, though. >> >> But let me also share how I see this.. I think we should rework the >> multifd thread model on channel establishment. >> >> Firstly, as I mentioned above, we must always join() the threads if it's >> JOINABLE or the resource is leaked, afaict. That's the first thing to fix. > > The problem that running is trying to fix is this: > * we get an additional error while we are exiting due to any reason. > > So we end calling multifd_fd_cleanup() several times, and things get > really weird. > > So what we really need is to be sure that: > - we join each thread once and only once > > https://linux.die.net/man/3/pthread_join > > Joining with a thread that has previously been joined results in undefined behavior. > I don't see how we would call multifd_save_cleanup() more than once. The qemu_mutex_destroy() calls in the loop below would already assert if called twice. >> Then let's see when TLS is there and what we do: we'll create "two" threads >> just for that, what's even worse: >> >> - We'll create tls handshake thread in multifd_tls_channel_connect() >> first, setting &p->thread. >> >> - Within the same thread, we do multifd_tls_outgoing_handshake() when >> handshake done -> multifd_channel_connect() -> we yet create the real >> multifd_send_thread(), setting &p->thread too. >> >> So AFAICT, the tls handshake thread is already leaked, got p->thread >> overwritten by the new thread created by itself.. >> >> I think we should fix this then. I haven't figured the best way to do, >> two things I can come up with now: >> >> 1) At least make the tls handshake thread detached. >> >> 2) Make the handshake done in multifd send thread directly; I don't yet >> see why we must create two threads.. > > The way we listen() in sockets anynchronously. QIO channels are a > mystery to me, and this was really weird. > > Using glib_main_loop() has lots of advantages. Understanding how to > listen asynchronously is not one of them. > >> Then assuming we have a clear model with all these threads issue fixed (no >> matter whether we'd shrink 2N threads into N threads), then what we need to >> do, IMHO, is making sure to join() all of them before destroying anything >> (say, per-channel MultiFDSendParams). Then when we destroy everything >> safely, either mutex/sem/etc.. Because no one will race us anymore. > >> Would it make sense? > > I can't see why we need to join the threads created for tls. That ones > could be not joinable and just call it a day. From the perspective of this series, we just need to be sure multifd_tls_outgoing_handshake() never runs in parallel with multifd_save_cleanup(). Joining the thread would be one way to do that.
diff --git a/migration/migration.c b/migration/migration.c index cca32c553c..52be20561b 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s) QEMUFile *tmp; trace_migrate_fd_cleanup(); + + multifd_save_cleanup(); + qemu_mutex_unlock_iothread(); if (s->migration_thread_running) { qemu_thread_join(&s->thread); @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s) } qemu_mutex_lock_iothread(); - multifd_save_cleanup(); qemu_mutex_lock(&s->qemu_file_lock); tmp = s->to_dst_file; s->to_dst_file = NULL; diff --git a/migration/multifd.c b/migration/multifd.c index ec58c58082..9ca482cfbe 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -527,6 +527,9 @@ void multifd_save_cleanup(void) if (p->running) { qemu_thread_join(&p->thread); + } else { + qemu_sem_post(&p->sem_sync); + qemu_sem_post(&multifd_send_state->channels_ready); } } for (i = 0; i < migrate_multifd_channels(); i++) { @@ -751,8 +754,6 @@ out: assert(local_err); trace_multifd_send_error(p->id); multifd_send_terminate_threads(local_err); - qemu_sem_post(&p->sem_sync); - qemu_sem_post(&multifd_send_state->channels_ready); error_free(local_err); } @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask *task, if (!qio_task_propagate_error(task, &err)) { trace_multifd_tls_outgoing_handshake_complete(ioc); - if (multifd_channel_connect(p, ioc, &err)) { - return; - } + multifd_channel_connect(p, ioc, NULL); + } else { + /* + * The multifd client could already be waiting to queue data, + * so let it know that we didn't even start. + */ + p->quit = true; + trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); } - - trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); - - /* - * Error happen, mark multifd_send_thread status as 'quit' although it - * is not created, and then tell who pay attention to me. - */ - p->quit = true; - qemu_sem_post(&multifd_send_state->channels_ready); - qemu_sem_post(&p->sem_sync); } static void *multifd_tls_handshake_thread(void *opaque) @@ -862,9 +858,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p, QIOChannel *ioc, Error *err) { migrate_set_error(migrate_get_current(), err); - /* Error happen, we need to tell who pay attention to me */ - qemu_sem_post(&multifd_send_state->channels_ready); - qemu_sem_post(&p->sem_sync); /* * Although multifd_send_thread is not created, but main migration * thread need to judge whether it is running, so we need to mark
We cannot operate on the multifd semaphores outside of the multifd channel thread because multifd_save_cleanup() can run in parallel and attempt to destroy the mutexes, which causes an assert. Looking at the places where we use the semaphores aside from the migration thread, there's only the TLS handshake thread and the initial multifd_channel_connect() in the main thread. These are places where creating the multifd channel cannot fail, so releasing the semaphores at these places only serves the purpose of avoiding a deadlock when an error happens before the channel(s) have started, but after the migration thread has already called multifd_send_sync_main(). Instead of attempting to release the semaphores at those places, move the release into multifd_save_cleanup(). This puts the semaphore usage in the same thread that does the cleanup, eliminating the race. Move the call to multifd_save_cleanup() before joining for the migration thread so we release the semaphores before. Signed-off-by: Fabiano Rosas <farosas@suse.de> --- migration/migration.c | 4 +++- migration/multifd.c | 29 +++++++++++------------------ 2 files changed, 14 insertions(+), 19 deletions(-)