diff mbox

[v2,for-2.11,2/4] coroutine: abort if we try to schedule or enter a pending coroutine

Message ID 1f0fd95c2096688add2c7b3cfcd7016756ef19fb.1511230683.git.jcody@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jeff Cody Nov. 21, 2017, 2:23 a.m. UTC
The previous patch fixed a race condition, in which there were
coroutines being executing doubly, or after coroutine deletion.

We can detect common scenarios when this happens, and print an error
message and abort before we corrupt memory / data, or segfault.

This patch will abort if an attempt to enter a coroutine is made while
it is currently pending execution, either in a specific AioContext bh,
or pending execution via a timer.  It will also abort if a coroutine
is scheduled, before a prior scheduled run has occured.

We cannot rely on the existing co->caller check for recursive re-entry
to catch this, as the coroutine may run and exit with
COROUTINE_TERMINATE before the scheduled coroutine executes.

(This is the scenario that was occuring and fixed in the previous
patch).

Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 include/qemu/coroutine_int.h |  6 ++++++
 util/async.c                 | 11 +++++++++++
 util/qemu-coroutine-sleep.c  | 11 +++++++++++
 util/qemu-coroutine.c        | 11 +++++++++++
 4 files changed, 39 insertions(+)

Comments

Stefan Hajnoczi Nov. 21, 2017, 10:59 a.m. UTC | #1
On Mon, Nov 20, 2017 at 09:23:24PM -0500, Jeff Cody wrote:
> @@ -438,6 +439,16 @@ fail:
>  void aio_co_schedule(AioContext *ctx, Coroutine *co)
>  {
>      trace_aio_co_schedule(ctx, co);
> +    const char *scheduled = atomic_read(&co->scheduled);
> +
> +    if (scheduled) {
> +        fprintf(stderr,
> +                "%s: Co-routine was already scheduled in '%s'\n",
> +                __func__, scheduled);
> +        abort();
> +    }
> +    atomic_set(&co->scheduled, __func__);

According to docs/devel/atomics.txt atomic_set()/atomic_read() are
weakly ordered.  They require memory barriers to provide guarantees
about ordering.  Your patch doesn't include barriers or comments about
where the implicit barriers are.

The docs recommend using the following instead of
atomic_read()/atomic_set() to get ordering:

  typeof(*ptr) atomic_mb_read(ptr)
  void         atomic_mb_set(ptr, val)
Eric Blake Nov. 21, 2017, 12:20 p.m. UTC | #2
On 11/20/2017 08:23 PM, Jeff Cody wrote:
> The previous patch fixed a race condition, in which there were
> coroutines being executing doubly, or after coroutine deletion.
> 
> We can detect common scenarios when this happens, and print an error
> message and abort before we corrupt memory / data, or segfault.
> 
> This patch will abort if an attempt to enter a coroutine is made while
> it is currently pending execution, either in a specific AioContext bh,
> or pending execution via a timer.  It will also abort if a coroutine
> is scheduled, before a prior scheduled run has occured.

s/occured/occurred/

> 
> We cannot rely on the existing co->caller check for recursive re-entry
> to catch this, as the coroutine may run and exit with
> COROUTINE_TERMINATE before the scheduled coroutine executes.
> 
> (This is the scenario that was occuring and fixed in the previous

s/occuring/occurring/

> patch).
> 
> Signed-off-by: Jeff Cody <jcody@redhat.com>
> ---
Paolo Bonzini Nov. 21, 2017, 1:11 p.m. UTC | #3
On 21/11/2017 11:59, Stefan Hajnoczi wrote:
> On Mon, Nov 20, 2017 at 09:23:24PM -0500, Jeff Cody wrote:
>> @@ -438,6 +439,16 @@ fail:
>>  void aio_co_schedule(AioContext *ctx, Coroutine *co)
>>  {
>>      trace_aio_co_schedule(ctx, co);
>> +    const char *scheduled = atomic_read(&co->scheduled);
>> +
>> +    if (scheduled) {
>> +        fprintf(stderr,
>> +                "%s: Co-routine was already scheduled in '%s'\n",
>> +                __func__, scheduled);
>> +        abort();
>> +    }
>> +    atomic_set(&co->scheduled, __func__);
> 
> According to docs/devel/atomics.txt atomic_set()/atomic_read() are
> weakly ordered.  They require memory barriers to provide guarantees
> about ordering.  Your patch doesn't include barriers or comments about
> where the implicit barriers are.
> 
> The docs recommend using the following instead of
> atomic_read()/atomic_set() to get ordering:
> 
>   typeof(*ptr) atomic_mb_read(ptr)
>   void         atomic_mb_set(ptr, val)

Even with atomic_mb_read/atomic_mb_set we should document what memory
ordering is required for correctness (i.e. what should be ordered
against what, or equivalently what operation has release/acquire semantics).

My reasoning is that the NULL write to co->scheduled should be ordered
before a write of co (anywhere), but the same is true for co->ctx so the
NULL write is ordered by the smp_wmb in qemu_aio_coroutine_enter.  So
that is fine but it needs a comment.

Dually, the read should be ordered after a read of co, but that's
handled by the callers via atomic_rcu_read if they need the ordering.
No comment needed here I think.

What's left is the __func__ write, where atomic_mb_set should be used
indeed. Or, perhaps better:

    const char *scheduled = atomic_cmpxchg(&co->scheduled,
                                           NULL, __func__);
    if (scheduled) {
        ...
    }

... which also removes the atomic_read.

Thanks,

Paolo
Kevin Wolf Nov. 21, 2017, 1:47 p.m. UTC | #4
Am 21.11.2017 um 03:23 hat Jeff Cody geschrieben:
> The previous patch fixed a race condition, in which there were
> coroutines being executing doubly, or after coroutine deletion.
> 
> We can detect common scenarios when this happens, and print an error
> message and abort before we corrupt memory / data, or segfault.
> 
> This patch will abort if an attempt to enter a coroutine is made while
> it is currently pending execution, either in a specific AioContext bh,
> or pending execution via a timer.  It will also abort if a coroutine
> is scheduled, before a prior scheduled run has occured.
> 
> We cannot rely on the existing co->caller check for recursive re-entry
> to catch this, as the coroutine may run and exit with
> COROUTINE_TERMINATE before the scheduled coroutine executes.
> 
> (This is the scenario that was occuring and fixed in the previous
> patch).
> 
> Signed-off-by: Jeff Cody <jcody@redhat.com>
> ---
>  include/qemu/coroutine_int.h |  6 ++++++
>  util/async.c                 | 11 +++++++++++
>  util/qemu-coroutine-sleep.c  | 11 +++++++++++
>  util/qemu-coroutine.c        | 11 +++++++++++
>  4 files changed, 39 insertions(+)
> 
> diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h
> index cb98892..56e4c48 100644
> --- a/include/qemu/coroutine_int.h
> +++ b/include/qemu/coroutine_int.h
> @@ -53,6 +53,12 @@ struct Coroutine {
>  
>      /* Only used when the coroutine has yielded.  */
>      AioContext *ctx;
> +
> +    /* Used to catch and abort on illegal co-routine entry.
> +     * Will contain the name of the function that had first
> +     * scheduled the coroutine. */
> +    const char *scheduled;

Not sure if it makes any difference in practice, but I just want to
mention that the new field is right after a cacheline boundary and
the only field that is used in qemu_aio_coroutine_enter() and accesses
this second cacheline.

I'm not paying much attention to this kind of thing in most contexts,
but entering a coroutine is a hot path that we want to be fast, so maybe
it's worth having a second look.

Kevin
Paolo Bonzini Nov. 21, 2017, 3:11 p.m. UTC | #5
On 21/11/2017 14:47, Kevin Wolf wrote:
> Am 21.11.2017 um 03:23 hat Jeff Cody geschrieben:
>> The previous patch fixed a race condition, in which there were
>> coroutines being executing doubly, or after coroutine deletion.
>>
>> We can detect common scenarios when this happens, and print an error
>> message and abort before we corrupt memory / data, or segfault.
>>
>> This patch will abort if an attempt to enter a coroutine is made while
>> it is currently pending execution, either in a specific AioContext bh,
>> or pending execution via a timer.  It will also abort if a coroutine
>> is scheduled, before a prior scheduled run has occured.
>>
>> We cannot rely on the existing co->caller check for recursive re-entry
>> to catch this, as the coroutine may run and exit with
>> COROUTINE_TERMINATE before the scheduled coroutine executes.
>>
>> (This is the scenario that was occuring and fixed in the previous
>> patch).
>>
>> Signed-off-by: Jeff Cody <jcody@redhat.com>
>> ---
>>  include/qemu/coroutine_int.h |  6 ++++++
>>  util/async.c                 | 11 +++++++++++
>>  util/qemu-coroutine-sleep.c  | 11 +++++++++++
>>  util/qemu-coroutine.c        | 11 +++++++++++
>>  4 files changed, 39 insertions(+)
>>
>> diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h
>> index cb98892..56e4c48 100644
>> --- a/include/qemu/coroutine_int.h
>> +++ b/include/qemu/coroutine_int.h
>> @@ -53,6 +53,12 @@ struct Coroutine {
>>  
>>      /* Only used when the coroutine has yielded.  */
>>      AioContext *ctx;
>> +
>> +    /* Used to catch and abort on illegal co-routine entry.
>> +     * Will contain the name of the function that had first
>> +     * scheduled the coroutine. */
>> +    const char *scheduled;
> 
> Not sure if it makes any difference in practice, but I just want to
> mention that the new field is right after a cacheline boundary and
> the only field that is used in qemu_aio_coroutine_enter() and accesses
> this second cacheline.
> 
> I'm not paying much attention to this kind of thing in most contexts,
> but entering a coroutine is a hot path that we want to be fast, so maybe
> it's worth having a second look.

Makes sense!  Since co_queue_wakeup is used on *yield*, maybe the order
should be: ctx, scheduled, co_queue_next, co_queue_wakeup,
co_scheduled_next.

Thanks,

Paolo
diff mbox

Patch

diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h
index cb98892..56e4c48 100644
--- a/include/qemu/coroutine_int.h
+++ b/include/qemu/coroutine_int.h
@@ -53,6 +53,12 @@  struct Coroutine {
 
     /* Only used when the coroutine has yielded.  */
     AioContext *ctx;
+
+    /* Used to catch and abort on illegal co-routine entry.
+     * Will contain the name of the function that had first
+     * scheduled the coroutine. */
+    const char *scheduled;
+
     QSIMPLEQ_ENTRY(Coroutine) co_queue_next;
     QSLIST_ENTRY(Coroutine) co_scheduled_next;
 };
diff --git a/util/async.c b/util/async.c
index 0e1bd87..49174b3 100644
--- a/util/async.c
+++ b/util/async.c
@@ -388,6 +388,7 @@  static void co_schedule_bh_cb(void *opaque)
         QSLIST_REMOVE_HEAD(&straight, co_scheduled_next);
         trace_aio_co_schedule_bh_cb(ctx, co);
         aio_context_acquire(ctx);
+        atomic_set(&co->scheduled, NULL);
         qemu_coroutine_enter(co);
         aio_context_release(ctx);
     }
@@ -438,6 +439,16 @@  fail:
 void aio_co_schedule(AioContext *ctx, Coroutine *co)
 {
     trace_aio_co_schedule(ctx, co);
+    const char *scheduled = atomic_read(&co->scheduled);
+
+    if (scheduled) {
+        fprintf(stderr,
+                "%s: Co-routine was already scheduled in '%s'\n",
+                __func__, scheduled);
+        abort();
+    }
+    atomic_set(&co->scheduled, __func__);
+
     QSLIST_INSERT_HEAD_ATOMIC(&ctx->scheduled_coroutines,
                               co, co_scheduled_next);
     qemu_bh_schedule(ctx->co_schedule_bh);
diff --git a/util/qemu-coroutine-sleep.c b/util/qemu-coroutine-sleep.c
index 9c56550..38dc4c8 100644
--- a/util/qemu-coroutine-sleep.c
+++ b/util/qemu-coroutine-sleep.c
@@ -13,6 +13,7 @@ 
 
 #include "qemu/osdep.h"
 #include "qemu/coroutine.h"
+#include "qemu/coroutine_int.h"
 #include "qemu/timer.h"
 #include "block/aio.h"
 
@@ -25,6 +26,7 @@  static void co_sleep_cb(void *opaque)
 {
     CoSleepCB *sleep_cb = opaque;
 
+    atomic_set(&sleep_cb->co->scheduled, NULL);
     aio_co_wake(sleep_cb->co);
 }
 
@@ -34,6 +36,15 @@  void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
     CoSleepCB sleep_cb = {
         .co = qemu_coroutine_self(),
     };
+    const char *scheduled = atomic_read(&sleep_cb.co->scheduled);
+
+    if (scheduled) {
+        fprintf(stderr,
+                "%s: Co-routine was already scheduled in '%s'\n",
+                __func__, scheduled);
+        abort();
+    }
+    atomic_set(&sleep_cb.co->scheduled, __func__);
     sleep_cb.ts = aio_timer_new(ctx, type, SCALE_NS, co_sleep_cb, &sleep_cb);
     timer_mod(sleep_cb.ts, qemu_clock_get_ns(type) + ns);
     qemu_coroutine_yield();
diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index d6095c1..fbfd0ad 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -106,9 +106,20 @@  void qemu_aio_coroutine_enter(AioContext *ctx, Coroutine *co)
 {
     Coroutine *self = qemu_coroutine_self();
     CoroutineAction ret;
+    const char *scheduled = atomic_read(&co->scheduled);
 
     trace_qemu_aio_coroutine_enter(ctx, self, co, co->entry_arg);
 
+    /* if the Coroutine has already been scheduled, entering it again will
+     * cause us to enter it twice, potentially even after the coroutine has
+     * been deleted */
+    if (scheduled) {
+        fprintf(stderr,
+                "%s: Co-routine was already scheduled in '%s'\n",
+                __func__, scheduled);
+        abort();
+    }
+
     if (co->caller) {
         fprintf(stderr, "Co-routine re-entered recursively\n");
         abort();