diff mbox series

[v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup

Message ID 20240325091850.1087235-1-zhuyangyang14@huawei.com (mailing list archive)
State New, archived
Headers show
Series [v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup | expand

Commit Message

Zhu Yangyang March 25, 2024, 9:18 a.m. UTC
If g_main_loop_run()/aio_poll() is called in the coroutine context,
the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
may be disordered.

When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
some listened events is completed. Therefore, the completion callback
function is dispatched.

If this callback function needs to invoke aio_co_enter(), it will only
wake up the coroutine (because we are already in coroutine context),
which may cause that the data on this listening event_fd/socket_fd
is not read/cleared. When the next poll () exits, it will be woken up again
and inserted into the wakeup queue again.

For example, if TLS is enabled in NBD, the server will call g_main_loop_run()
in the coroutine, and repeatedly wake up the io_read event on a socket.
The call stack is as follows:

aio_co_enter()
aio_co_wake()
qio_channel_restart_read()
aio_dispatch_handler()
aio_dispatch_handlers()
aio_dispatch()
aio_ctx_dispatch()
g_main_context_dispatch()
g_main_loop_run()
nbd_negotiate_handle_starttls()
nbd_negotiate_options()
nbd_negotiate()
nbd_co_client_start()
coroutine_trampoline()

Signed-off-by: zhuyangyang <zhuyangyang14@huawei.com>
---
 util/async.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

Stefan Hajnoczi March 25, 2024, 3:50 p.m. UTC | #1
On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote:
> If g_main_loop_run()/aio_poll() is called in the coroutine context,
> the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> may be disordered.

aio_poll() must not be called from coroutine context:

  bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking);
       ^^^^^^^^^^^^^^^

Coroutines are not supposed to block. Instead, they should yield.

> When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> some listened events is completed. Therefore, the completion callback
> function is dispatched.
> 
> If this callback function needs to invoke aio_co_enter(), it will only
> wake up the coroutine (because we are already in coroutine context),
> which may cause that the data on this listening event_fd/socket_fd
> is not read/cleared. When the next poll () exits, it will be woken up again
> and inserted into the wakeup queue again.
> 
> For example, if TLS is enabled in NBD, the server will call g_main_loop_run()
> in the coroutine, and repeatedly wake up the io_read event on a socket.
> The call stack is as follows:
> 
> aio_co_enter()
> aio_co_wake()
> qio_channel_restart_read()
> aio_dispatch_handler()
> aio_dispatch_handlers()
> aio_dispatch()
> aio_ctx_dispatch()
> g_main_context_dispatch()
> g_main_loop_run()
> nbd_negotiate_handle_starttls()

This code does not look like it was designed to run in coroutine
context. Two options:

1. Don't run it in coroutine context (e.g. use a BH instead). This
   avoids blocking the coroutine but calling g_main_loop_run() is still
   ugly, in my opinion.

2. Get rid of data.loop and use coroutine APIs instead:

   while (!data.complete) {
       qemu_coroutine_yield();
   }

   and update nbd_tls_handshake() to call aio_co_wake(data->co) instead
   of g_main_loop_quit(data->loop).

   This requires auditing the code to check whether the event loop might
   invoke something that interferes with
   nbd_negotiate_handle_starttls(). Typically this means monitor
   commands or fd activity that could change the state of this
   connection while it is yielded. This is where the real work is but
   hopefully it will not be that hard to figure out.

> nbd_negotiate_options()
> nbd_negotiate()
> nbd_co_client_start()
> coroutine_trampoline()
> 
> Signed-off-by: zhuyangyang <zhuyangyang14@huawei.com>
> ---
>  util/async.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/util/async.c b/util/async.c
> index 0467890052..25fc1e6083 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -705,7 +705,18 @@ void aio_co_enter(AioContext *ctx, Coroutine *co)
>      if (qemu_in_coroutine()) {
>          Coroutine *self = qemu_coroutine_self();
>          assert(self != co);
> -        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
> +        /*
> +         * If the Coroutine *co is already in the co_queue_wakeup, this
> +         * repeated insertion will causes the loss of other queue element
> +         * or infinite loop.
> +         * For examplex:
> +         * Head->a->b->c->NULL, after insert_tail(head, b) => Head->a->b->NULL
> +         * Head->a-b>->NULL, after insert_tail(head, b) => Head->a->b->b...
> +         */
> +        if (!co->co_queue_next.sqe_next &&
> +            self->co_queue_wakeup.sqh_last != &co->co_queue_next.sqe_next) {
> +            QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
> +        }
>      } else {
>          qemu_aio_coroutine_enter(ctx, co);
>      }
> -- 
> 2.33.0
>
Eric Blake March 25, 2024, 4 p.m. UTC | #2
I've seen (and agree with) Stefan's reply that a more thorough audit
is needed, but am still providing a preliminary review based on what I
see here.

On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang via wrote:
> If g_main_loop_run()/aio_poll() is called in the coroutine context,
> the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> may be disordered.

coroutine context should not be doing anything that can block; you may
have uncovered a bigger bug that we need to address.

> 
> When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> some listened events is completed. Therefore, the completion callback
> function is dispatched.
> 
> If this callback function needs to invoke aio_co_enter(), it will only
> wake up the coroutine (because we are already in coroutine context),
> which may cause that the data on this listening event_fd/socket_fd
> is not read/cleared. When the next poll () exits, it will be woken up again
> and inserted into the wakeup queue again.
> 
> For example, if TLS is enabled in NBD, the server will call g_main_loop_run()
> in the coroutine, and repeatedly wake up the io_read event on a socket.
> The call stack is as follows:
> 
> aio_co_enter()
> aio_co_wake()
> qio_channel_restart_read()
> aio_dispatch_handler()
> aio_dispatch_handlers()
> aio_dispatch()
> aio_ctx_dispatch()
> g_main_context_dispatch()
> g_main_loop_run()
> nbd_negotiate_handle_starttls()
> nbd_negotiate_options()
> nbd_negotiate()
> nbd_co_client_start()
> coroutine_trampoline()
> 
> Signed-off-by: zhuyangyang <zhuyangyang14@huawei.com>
> ---
>  util/async.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/util/async.c b/util/async.c
> index 0467890052..25fc1e6083 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -705,7 +705,18 @@ void aio_co_enter(AioContext *ctx, Coroutine *co)
>      if (qemu_in_coroutine()) {
>          Coroutine *self = qemu_coroutine_self();
>          assert(self != co);
> -        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
> +        /*
> +         * If the Coroutine *co is already in the co_queue_wakeup, this
> +         * repeated insertion will causes the loss of other queue element

s/causes/cause/

> +         * or infinite loop.
> +         * For examplex:

s/examplex/example/

> +         * Head->a->b->c->NULL, after insert_tail(head, b) => Head->a->b->NULL
> +         * Head->a-b>->NULL, after insert_tail(head, b) => Head->a->b->b...

s/b>->/b->/

> +         */
> +        if (!co->co_queue_next.sqe_next &&
> +            self->co_queue_wakeup.sqh_last != &co->co_queue_next.sqe_next) {
> +            QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
> +        }
>      } else {
>          qemu_aio_coroutine_enter(ctx, co);
>      }

Intuitively, attacking the symptoms (avoiding bogus list insertion
when it is already on the list) makes sense; but I want to make sure
we attack the root cause.
Zhu Yangyang March 26, 2024, 11:53 a.m. UTC | #3
On Mon, 25 Mar 2024 11:50:41 -0400, Stefan Hajnoczi wrote:
> On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote:
> > If g_main_loop_run()/aio_poll() is called in the coroutine context,
> > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> > may be disordered.
> 
> aio_poll() must not be called from coroutine context:
> 
>   bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking);
>        ^^^^^^^^^^^^^^^
> 
> Coroutines are not supposed to block. Instead, they should yield.
> 
> > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> > some listened events is completed. Therefore, the completion callback
> > function is dispatched.
> > 
> > If this callback function needs to invoke aio_co_enter(), it will only
> > wake up the coroutine (because we are already in coroutine context),
> > which may cause that the data on this listening event_fd/socket_fd
> > is not read/cleared. When the next poll () exits, it will be woken up again
> > and inserted into the wakeup queue again.
> > 
> > For example, if TLS is enabled in NBD, the server will call g_main_loop_run()
> > in the coroutine, and repeatedly wake up the io_read event on a socket.
> > The call stack is as follows:
> > 
> > aio_co_enter()
> > aio_co_wake()
> > qio_channel_restart_read()
> > aio_dispatch_handler()
> > aio_dispatch_handlers()
> > aio_dispatch()
> > aio_ctx_dispatch()
> > g_main_context_dispatch()
> > g_main_loop_run()
> > nbd_negotiate_handle_starttls()
> 
> This code does not look like it was designed to run in coroutine
> context. Two options:
> 
> 1. Don't run it in coroutine context (e.g. use a BH instead). This
>    avoids blocking the coroutine but calling g_main_loop_run() is still
>    ugly, in my opinion.
> 
> 2. Get rid of data.loop and use coroutine APIs instead:
> 
>    while (!data.complete) {
>        qemu_coroutine_yield();
>    }
> 
>    and update nbd_tls_handshake() to call aio_co_wake(data->co) instead
>    of g_main_loop_quit(data->loop).
> 
>    This requires auditing the code to check whether the event loop might
>    invoke something that interferes with
>    nbd_negotiate_handle_starttls(). Typically this means monitor
>    commands or fd activity that could change the state of this
>    connection while it is yielded. This is where the real work is but
>    hopefully it will not be that hard to figure out.

Thank you for your help, I'll try to fix it using the coroutine api.

> 
> > nbd_negotiate_options()
> > nbd_negotiate()
> > nbd_co_client_start()
> > coroutine_trampoline()
> >
Zhu Yangyang March 26, 2024, 12:25 p.m. UTC | #4
On Mon, 25 Mar 2024 11:00:31 -0500 Eric Blake wrote:
> >  util/async.c | 13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/util/async.c b/util/async.c
> > index 0467890052..25fc1e6083 100644
> > --- a/util/async.c
> > +++ b/util/async.c
> > @@ -705,7 +705,18 @@ void aio_co_enter(AioContext *ctx, Coroutine *co)
> >      if (qemu_in_coroutine()) {
> >          Coroutine *self = qemu_coroutine_self();
> >          assert(self != co);
> > -        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
> > +        /*
> > +         * If the Coroutine *co is already in the co_queue_wakeup, this
> > +         * repeated insertion will causes the loss of other queue element
> 
> s/causes/cause/
> 
> > +         * or infinite loop.
> > +         * For examplex:
> 
> s/examplex/example/
> 
> > +         * Head->a->b->c->NULL, after insert_tail(head, b) => Head->a->b->NULL
> > +         * Head->a-b>->NULL, after insert_tail(head, b) => Head->a->b->b...
> 
> s/b>->/b->/
> 
> > +         */
> > +        if (!co->co_queue_next.sqe_next &&
> > +            self->co_queue_wakeup.sqh_last != &co->co_queue_next.sqe_next) {
> > +            QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
> > +        }
> >      } else {
> >          qemu_aio_coroutine_enter(ctx, co);
> >      }
> 
> Intuitively, attacking the symptoms (avoiding bogus list insertion
> when it is already on the list) makes sense; but I want to make sure
> we attack the root cause.

Repairing the nbd_negotiate_handle_starttls() can solve this problem, therefore,
I'm not sure if this commit is still needed.

--
Best Regards,
Zhu Yangyang
Eric Blake March 27, 2024, 10:13 p.m. UTC | #5
On Mon, Mar 25, 2024 at 11:50:41AM -0400, Stefan Hajnoczi wrote:
> On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote:
> > If g_main_loop_run()/aio_poll() is called in the coroutine context,
> > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> > may be disordered.
> 
> aio_poll() must not be called from coroutine context:
> 
>   bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking);
>        ^^^^^^^^^^^^^^^
> 
> Coroutines are not supposed to block. Instead, they should yield.
> 
> > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> > some listened events is completed. Therefore, the completion callback
> > function is dispatched.
> > 
> > If this callback function needs to invoke aio_co_enter(), it will only
> > wake up the coroutine (because we are already in coroutine context),
> > which may cause that the data on this listening event_fd/socket_fd
> > is not read/cleared. When the next poll () exits, it will be woken up again
> > and inserted into the wakeup queue again.
> > 
> > For example, if TLS is enabled in NBD, the server will call g_main_loop_run()
> > in the coroutine, and repeatedly wake up the io_read event on a socket.
> > The call stack is as follows:
> > 
> > aio_co_enter()
> > aio_co_wake()
> > qio_channel_restart_read()
> > aio_dispatch_handler()
> > aio_dispatch_handlers()
> > aio_dispatch()
> > aio_ctx_dispatch()
> > g_main_context_dispatch()
> > g_main_loop_run()
> > nbd_negotiate_handle_starttls()
> 
> This code does not look like it was designed to run in coroutine
> context. Two options:
> 
> 1. Don't run it in coroutine context (e.g. use a BH instead). This
>    avoids blocking the coroutine but calling g_main_loop_run() is still
>    ugly, in my opinion.
> 
> 2. Get rid of data.loop and use coroutine APIs instead:
> 
>    while (!data.complete) {
>        qemu_coroutine_yield();
>    }
> 
>    and update nbd_tls_handshake() to call aio_co_wake(data->co) instead
>    of g_main_loop_quit(data->loop).
> 
>    This requires auditing the code to check whether the event loop might
>    invoke something that interferes with
>    nbd_negotiate_handle_starttls(). Typically this means monitor
>    commands or fd activity that could change the state of this
>    connection while it is yielded. This is where the real work is but
>    hopefully it will not be that hard to figure out.

I agree that 1) is ugly.  So far, I've added some temporary
assertions, to see when qio_channel_tls_handshake is reached; it looks
like nbd/server.c is calling it from within coroutine context, but
nbd/client.c is calling it from the main loop.  The qio code handles
either, but now I'm stuck in trying to get client.c into having the
right coroutine context; the TLS handshake is done before the usual
BlockDriverState *bs object is available, so I'm not even sure what
aio context, if any, I should be using.  But on my first try, I'm
hitting:

qemu-img: ../util/async.c:707: aio_co_enter: Assertion `self != co' failed.

so I obviously got something wrong.

It may be possible to use block_gen_c to turn nbd_receive_negotiate
and nbd_receive_export_list into co_wrapper functions, if I can audit
that all of their code goes through qio (and is therefore
coroutine-safe), but that work is still ongoing.

At any rate, qemu-iotest 233 should be good for testing that changes
in this area work; I've also been testing with libnbd (test
interop/interop-qemu-nbd-tls-certs hits qemu's server.c) and nbdkit
(test tests/test-tls-psk.sh hits qemu's client.c).
Kevin Wolf March 28, 2024, 8:54 a.m. UTC | #6
Am 27.03.2024 um 23:13 hat Eric Blake geschrieben:
> On Mon, Mar 25, 2024 at 11:50:41AM -0400, Stefan Hajnoczi wrote:
> > On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote:
> > > If g_main_loop_run()/aio_poll() is called in the coroutine context,
> > > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> > > may be disordered.
> > 
> > aio_poll() must not be called from coroutine context:
> > 
> >   bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking);
> >        ^^^^^^^^^^^^^^^
> > 
> > Coroutines are not supposed to block. Instead, they should yield.
> > 
> > > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> > > some listened events is completed. Therefore, the completion callback
> > > function is dispatched.
> > > 
> > > If this callback function needs to invoke aio_co_enter(), it will only
> > > wake up the coroutine (because we are already in coroutine context),
> > > which may cause that the data on this listening event_fd/socket_fd
> > > is not read/cleared. When the next poll () exits, it will be woken up again
> > > and inserted into the wakeup queue again.
> > > 
> > > For example, if TLS is enabled in NBD, the server will call g_main_loop_run()
> > > in the coroutine, and repeatedly wake up the io_read event on a socket.
> > > The call stack is as follows:
> > > 
> > > aio_co_enter()
> > > aio_co_wake()
> > > qio_channel_restart_read()
> > > aio_dispatch_handler()
> > > aio_dispatch_handlers()
> > > aio_dispatch()
> > > aio_ctx_dispatch()
> > > g_main_context_dispatch()
> > > g_main_loop_run()
> > > nbd_negotiate_handle_starttls()
> > 
> > This code does not look like it was designed to run in coroutine
> > context. Two options:
> > 
> > 1. Don't run it in coroutine context (e.g. use a BH instead). This
> >    avoids blocking the coroutine but calling g_main_loop_run() is still
> >    ugly, in my opinion.
> > 
> > 2. Get rid of data.loop and use coroutine APIs instead:
> > 
> >    while (!data.complete) {
> >        qemu_coroutine_yield();
> >    }
> > 
> >    and update nbd_tls_handshake() to call aio_co_wake(data->co) instead
> >    of g_main_loop_quit(data->loop).
> > 
> >    This requires auditing the code to check whether the event loop might
> >    invoke something that interferes with
> >    nbd_negotiate_handle_starttls(). Typically this means monitor
> >    commands or fd activity that could change the state of this
> >    connection while it is yielded. This is where the real work is but
> >    hopefully it will not be that hard to figure out.
> 
> I agree that 1) is ugly.  So far, I've added some temporary
> assertions, to see when qio_channel_tls_handshake is reached; it looks
> like nbd/server.c is calling it from within coroutine context, but
> nbd/client.c is calling it from the main loop.  The qio code handles
> either, but now I'm stuck in trying to get client.c into having the
> right coroutine context; the TLS handshake is done before the usual
> BlockDriverState *bs object is available, so I'm not even sure what
> aio context, if any, I should be using.  But on my first try, I'm
> hitting:
> 
> qemu-img: ../util/async.c:707: aio_co_enter: Assertion `self != co' failed.
> 
> so I obviously got something wrong.

Hard to tell without seeing the code, but it looks like you're trying to
wake up the coroutine while you're still executing in the context of the
same coroutine.

It looks like the documentation of qio_channel_tls_handshake() is wrong
and the function can return and call the callback immediately without
dropping out of coroutine context.

A rather heavy-handed, but obvious approach would be moving the
qio_channel_tls_handshake() into a BH, then you know you'll always be
outside of coroutine context in the callback.

But maybe it would be enough to just check if the coroutine isn't
already active:

    if (!qemu_coroutine_entered(co)) {
        aio_wake_co(co);
    }

> It may be possible to use block_gen_c to turn nbd_receive_negotiate
> and nbd_receive_export_list into co_wrapper functions, if I can audit
> that all of their code goes through qio (and is therefore
> coroutine-safe), but that work is still ongoing.

If it's possible, I think that would be nicer in the code and would also
reduce the time the guest is blocked while we're creating a new NBD
connection.

*reads code*

Hm... Actually, one thing I was completely unaware of is that all of
this is running in a separate thread, so maybe the existing synchronous
code already doesn't block the guest. nbd_co_establish_connection()
starts this thread. The thread doesn't have an AioContext, so anything
that involves scheduling something in an AioContext (including BHs and
waking up coroutines) will result in code running in a different thread.

I'm not sure why a thread is used in the first place (does it do
something that coroutines can't do?) and if running parts of the code in
a different thread would be a problem, but we should probably have a
look at this first. Mixing threads and coroutines feels strange.

Kevin
Eric Blake March 28, 2024, 12:40 p.m. UTC | #7
On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang via wrote:
> If g_main_loop_run()/aio_poll() is called in the coroutine context,
> the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> may be disordered.
> 
> When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> some listened events is completed. Therefore, the completion callback
> function is dispatched.
> 
> If this callback function needs to invoke aio_co_enter(), it will only
> wake up the coroutine (because we are already in coroutine context),
> which may cause that the data on this listening event_fd/socket_fd
> is not read/cleared. When the next poll () exits, it will be woken up again
> and inserted into the wakeup queue again.
> 
> For example, if TLS is enabled in NBD, the server will call g_main_loop_run()
> in the coroutine, and repeatedly wake up the io_read event on a socket.
> The call stack is as follows:
> 
> aio_co_enter()
> aio_co_wake()
> qio_channel_restart_read()
> aio_dispatch_handler()
> aio_dispatch_handlers()
> aio_dispatch()
> aio_ctx_dispatch()
> g_main_context_dispatch()
> g_main_loop_run()
> nbd_negotiate_handle_starttls()
> nbd_negotiate_options()
> nbd_negotiate()
> nbd_co_client_start()
> coroutine_trampoline()

zhuyangyang, do you have a reliable reproduction setup for how you
were able to trigger this?  Obviously, it only happens when TLS is
enabled (we aren't creating a g_main_loop_run for any other NBD
command), and only when the server is first starting to serve a
client; is this a case where you were hammering a long-running qemu
process running an NBD server with multiple clients trying to
reconnect to the server all near the same time?

If we can come up with a reliable formula for reproducing the
corrupted coroutine list, it would make a great iotest addition
alongside the existing qemu-iotests 233 for ensuring that NBD TLS
traffic is handled correctly in both server and client.

> 
> Signed-off-by: zhuyangyang <zhuyangyang14@huawei.com>

Side note: this appears to be your first qemu contribution (based on
'git shortlog --author zhuyangyang').  While I am not in a position to
presume how you would like your name Anglicized, I will point out that
the prevailing style is to separate given name from family name (just
because your username at work has no spaces does not mean that your
S-o-b has to follow suit).  It is also permissible to list your name
in native characters alongside or in place of the Anglicized version;
for example, 'git log --author="Stefano Dong"' shows this technique.
Zhu Yangyang March 29, 2024, 1:09 p.m. UTC | #8
On Thu, 28 Mar 2024 07:40:14 -0500, Eric Blake wrote:
> On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang via wrote:
> > If g_main_loop_run()/aio_poll() is called in the coroutine context,
> > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> > may be disordered.
> > 
> > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> > some listened events is completed. Therefore, the completion callback
> > function is dispatched.
> > 
> > If this callback function needs to invoke aio_co_enter(), it will only
> > wake up the coroutine (because we are already in coroutine context),
> > which may cause that the data on this listening event_fd/socket_fd
> > is not read/cleared. When the next poll () exits, it will be woken up again
> > and inserted into the wakeup queue again.
> > 
> > For example, if TLS is enabled in NBD, the server will call g_main_loop_run()
> > in the coroutine, and repeatedly wake up the io_read event on a socket.
> > The call stack is as follows:
> > 
> > aio_co_enter()
> > aio_co_wake()
> > qio_channel_restart_read()
> > aio_dispatch_handler()
> > aio_dispatch_handlers()
> > aio_dispatch()
> > aio_ctx_dispatch()
> > g_main_context_dispatch()
> > g_main_loop_run()
> > nbd_negotiate_handle_starttls()
> > nbd_negotiate_options()
> > nbd_negotiate()
> > nbd_co_client_start()
> > coroutine_trampoline()
> 
> zhuyangyang, do you have a reliable reproduction setup for how you
> were able to trigger this?  Obviously, it only happens when TLS is
> enabled (we aren't creating a g_main_loop_run for any other NBD
> command), and only when the server is first starting to serve a
> client; is this a case where you were hammering a long-running qemu
> process running an NBD server with multiple clients trying to
> reconnect to the server all near the same time?

I'm sorry I didn't make the background of the problem clear before,
this problem is not on the NBD command, but on the VM live migration
with qemu TLS.

Next, I'll detail how to reproduce the issue.

1. Make the problem more obvious.

When TLS is enabled during live migration, the migration progress
may be suspended because some I/O are not returned during the mirror job
on target host.

Now we know that the reason is that some coroutines are lost.
The entry function of these lost coroutines are nbd_trip().

Add an assertion on the target host side to make the problem
show up quickly.

$ git diff util/async.c
diff --git a/util/async.c b/util/async.c
index 0467890052..4e3547c3ea 100644
--- a/util/async.c
+++ b/util/async.c
@@ -705,6 +705,7 @@ void aio_co_enter(AioContext *ctx, Coroutine *co)
     if (qemu_in_coroutine()) {
         Coroutine *self = qemu_coroutine_self();
         assert(self != co);
+        assert(!co->co_queue_next.sqe_next);
         QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
     } else {
         qemu_aio_coroutine_enter(ctx, co);

2. Reproduce the issue

1) start vm on the origin host

Note: Configuring multiple disks for a VM 
(It is recommended to configure more than 6 disks)

These disk tasks(nbd_trip) will be performed simultaneously 
with nbd_negotiate_handle_starttls() on the main thread during migrate.

<domain type='kvm' id='1'>
  <name>centos7.3_64_server</name>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-9.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic eoi='on'/>
    <pae/>
  </features>
  <cpu mode='host-passthrough' check='none' migratable='on'/>
  <clock offset='utc'>
    <timer name='hpet' present='no'/>
    <timer name='rtc' tickpolicy='catchup' track='guest'/>
    <timer name='pit' tickpolicy='delay'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='/Images/TestImg/centos7.3_64_server' index='6'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='/Images/TestImg/kvm-disk-001' index='5'/>
      <backingStore/>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='/Images/TestImg/kvm-disk-002' index='4'/>
      <backingStore/>
      <target dev='vdc' bus='virtio'/>
      <alias name='virtio-disk2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='/Images/TestImg/kvm-disk-003' index='3'/>
      <backingStore/>
      <target dev='vdd' bus='virtio'/>
      <alias name='virtio-disk3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='/Images/TestImg/kvm-disk-004' index='2'/>
      <backingStore/>
      <target dev='vde' bus='virtio'/>
      <alias name='virtio-disk4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='/Images/TestImg/kvm-disk-005' index='1'/>
      <backingStore/>
      <target dev='vdf' bus='virtio'/>
      <alias name='virtio-disk5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </disk>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
  </devices>
</domain>

$ virsh create vm_x86.xml
Domain 'centos7.3_64_server' created from /home/vm_x86.xml

2) migrate the vm to target host
virsh migrate --live --p2p \
   --migrateuri tcp:10.91.xxx.xxx centos7.3_64_server qemu+tcp://10.91.xxx.xxx/system \
   --copy-storage-all \
   --tls

Than, An error is reported on the peer host
qemu-kvm: ../util/async.c:705: aio_co_enter: Assertion `!co->co_queue_next.sqe_next' failed.

> 
> If we can come up with a reliable formula for reproducing the
> corrupted coroutine list, it would make a great iotest addition
> alongside the existing qemu-iotests 233 for ensuring that NBD TLS
> traffic is handled correctly in both server and client.

I'm not sure if this can be used for testing of qemu-nbd

> 
> > 
> > Signed-off-by: zhuyangyang <zhuyangyang14@huawei.com>
> 
> Side note: this appears to be your first qemu contribution (based on
> 'git shortlog --author zhuyangyang').  While I am not in a position to
> presume how you would like your name Anglicized, I will point out that
> the prevailing style is to separate given name from family name (just
> because your username at work has no spaces does not mean that your
> S-o-b has to follow suit).  It is also permissible to list your name
> in native characters alongside or in place of the Anglicized version;
> for example, 'git log --author="Stefano Dong"' shows this technique.

Yes, I will update my name in the next submission, thank you very much for your help
Zhu Yangyang April 1, 2024, 8:04 a.m. UTC | #9
On Thu, 28 Mar 2024 07:40:14 -0500, Eric Blake wrote:
> On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang via wrote:
> > If g_main_loop_run()/aio_poll() is called in the coroutine context,
> > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> > may be disordered.
> > 
> > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> > some listened events is completed. Therefore, the completion callback
> > function is dispatched.
> > 
> > If this callback function needs to invoke aio_co_enter(), it will only
> > wake up the coroutine (because we are already in coroutine context),
> > which may cause that the data on this listening event_fd/socket_fd
> > is not read/cleared. When the next poll () exits, it will be woken up again
> > and inserted into the wakeup queue again.
> > 
> > For example, if TLS is enabled in NBD, the server will call g_main_loop_run()
> > in the coroutine, and repeatedly wake up the io_read event on a socket.
> > The call stack is as follows:
> > 
> > aio_co_enter()
> > aio_co_wake()
> > qio_channel_restart_read()
> > aio_dispatch_handler()
> > aio_dispatch_handlers()
> > aio_dispatch()
> > aio_ctx_dispatch()
> > g_main_context_dispatch()
> > g_main_loop_run()
> > nbd_negotiate_handle_starttls()
> > nbd_negotiate_options()
> > nbd_negotiate()
> > nbd_co_client_start()
> > coroutine_trampoline()
> 
> zhuyangyang, do you have a reliable reproduction setup for how you
> were able to trigger this?  Obviously, it only happens when TLS is
> enabled (we aren't creating a g_main_loop_run for any other NBD
> command), and only when the server is first starting to serve a
> client; is this a case where you were hammering a long-running qemu
> process running an NBD server with multiple clients trying to
> reconnect to the server all near the same time?

This problem cannot be reproduced after 
7c1f51bf38 ("nbd/server: Fix drained_poll to wake coroutine in right AioContext") 
that avoids repeatedly waking up the same coroutine.

Invoking g_main_loop_run() in the coroutine will cause that 
event completion callback function qio_channel_restart_read() is called repeatedly, 
but the coroutine is woken up only once.

The key modifications are as follows:

static void qio_channel_restart_read(void *opaque)
{
    QIOChannel *ioc = opaque;
-   Coroutine *co = ioc->read_coroutine;
+   Coroutine *co = qatomic_xchg(&ioc->read_coroutine, NULL);
+
+   if (!co) {
+       return;
+   }

    /* Assert that aio_co_wake() reenters the coroutine directly */
    assert(qemu_get_current_aio_context() ==
           qemu_coroutine_get_aio_context(co));
    aio_co_wake(co);
}

The root cause is that poll() is invoked in coroutine context.

> 
> If we can come up with a reliable formula for reproducing the
> corrupted coroutine list, it would make a great iotest addition
> alongside the existing qemu-iotests 233 for ensuring that NBD TLS
> traffic is handled correctly in both server and client.
> 
> > 
> > Signed-off-by: zhuyangyang <zhuyangyang14@huawei.com>
> 
> Side note: this appears to be your first qemu contribution (based on
> 'git shortlog --author zhuyangyang').  While I am not in a position to
> presume how you would like your name Anglicized, I will point out that
> the prevailing style is to separate given name from family name (just
> because your username at work has no spaces does not mean that your
> S-o-b has to follow suit).  It is also permissible to list your name
> in native characters alongside or in place of the Anglicized version;
> for example, 'git log --author="Stefano Dong"' shows this technique.
diff mbox series

Patch

diff --git a/util/async.c b/util/async.c
index 0467890052..25fc1e6083 100644
--- a/util/async.c
+++ b/util/async.c
@@ -705,7 +705,18 @@  void aio_co_enter(AioContext *ctx, Coroutine *co)
     if (qemu_in_coroutine()) {
         Coroutine *self = qemu_coroutine_self();
         assert(self != co);
-        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
+        /*
+         * If the Coroutine *co is already in the co_queue_wakeup, this
+         * repeated insertion will causes the loss of other queue element
+         * or infinite loop.
+         * For examplex:
+         * Head->a->b->c->NULL, after insert_tail(head, b) => Head->a->b->NULL
+         * Head->a-b>->NULL, after insert_tail(head, b) => Head->a->b->b...
+         */
+        if (!co->co_queue_next.sqe_next &&
+            self->co_queue_wakeup.sqh_last != &co->co_queue_next.sqe_next) {
+            QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
+        }
     } else {
         qemu_aio_coroutine_enter(ctx, co);
     }