Message ID | 1f0fd95c2096688add2c7b3cfcd7016756ef19fb.1511230683.git.jcody@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Nov 20, 2017 at 09:23:24PM -0500, Jeff Cody wrote: > @@ -438,6 +439,16 @@ fail: > void aio_co_schedule(AioContext *ctx, Coroutine *co) > { > trace_aio_co_schedule(ctx, co); > + const char *scheduled = atomic_read(&co->scheduled); > + > + if (scheduled) { > + fprintf(stderr, > + "%s: Co-routine was already scheduled in '%s'\n", > + __func__, scheduled); > + abort(); > + } > + atomic_set(&co->scheduled, __func__); According to docs/devel/atomics.txt atomic_set()/atomic_read() are weakly ordered. They require memory barriers to provide guarantees about ordering. Your patch doesn't include barriers or comments about where the implicit barriers are. The docs recommend using the following instead of atomic_read()/atomic_set() to get ordering: typeof(*ptr) atomic_mb_read(ptr) void atomic_mb_set(ptr, val)
On 11/20/2017 08:23 PM, Jeff Cody wrote: > The previous patch fixed a race condition, in which there were > coroutines being executing doubly, or after coroutine deletion. > > We can detect common scenarios when this happens, and print an error > message and abort before we corrupt memory / data, or segfault. > > This patch will abort if an attempt to enter a coroutine is made while > it is currently pending execution, either in a specific AioContext bh, > or pending execution via a timer. It will also abort if a coroutine > is scheduled, before a prior scheduled run has occured. s/occured/occurred/ > > We cannot rely on the existing co->caller check for recursive re-entry > to catch this, as the coroutine may run and exit with > COROUTINE_TERMINATE before the scheduled coroutine executes. > > (This is the scenario that was occuring and fixed in the previous s/occuring/occurring/ > patch). > > Signed-off-by: Jeff Cody <jcody@redhat.com> > ---
On 21/11/2017 11:59, Stefan Hajnoczi wrote: > On Mon, Nov 20, 2017 at 09:23:24PM -0500, Jeff Cody wrote: >> @@ -438,6 +439,16 @@ fail: >> void aio_co_schedule(AioContext *ctx, Coroutine *co) >> { >> trace_aio_co_schedule(ctx, co); >> + const char *scheduled = atomic_read(&co->scheduled); >> + >> + if (scheduled) { >> + fprintf(stderr, >> + "%s: Co-routine was already scheduled in '%s'\n", >> + __func__, scheduled); >> + abort(); >> + } >> + atomic_set(&co->scheduled, __func__); > > According to docs/devel/atomics.txt atomic_set()/atomic_read() are > weakly ordered. They require memory barriers to provide guarantees > about ordering. Your patch doesn't include barriers or comments about > where the implicit barriers are. > > The docs recommend using the following instead of > atomic_read()/atomic_set() to get ordering: > > typeof(*ptr) atomic_mb_read(ptr) > void atomic_mb_set(ptr, val) Even with atomic_mb_read/atomic_mb_set we should document what memory ordering is required for correctness (i.e. what should be ordered against what, or equivalently what operation has release/acquire semantics). My reasoning is that the NULL write to co->scheduled should be ordered before a write of co (anywhere), but the same is true for co->ctx so the NULL write is ordered by the smp_wmb in qemu_aio_coroutine_enter. So that is fine but it needs a comment. Dually, the read should be ordered after a read of co, but that's handled by the callers via atomic_rcu_read if they need the ordering. No comment needed here I think. What's left is the __func__ write, where atomic_mb_set should be used indeed. Or, perhaps better: const char *scheduled = atomic_cmpxchg(&co->scheduled, NULL, __func__); if (scheduled) { ... } ... which also removes the atomic_read. Thanks, Paolo
Am 21.11.2017 um 03:23 hat Jeff Cody geschrieben: > The previous patch fixed a race condition, in which there were > coroutines being executing doubly, or after coroutine deletion. > > We can detect common scenarios when this happens, and print an error > message and abort before we corrupt memory / data, or segfault. > > This patch will abort if an attempt to enter a coroutine is made while > it is currently pending execution, either in a specific AioContext bh, > or pending execution via a timer. It will also abort if a coroutine > is scheduled, before a prior scheduled run has occured. > > We cannot rely on the existing co->caller check for recursive re-entry > to catch this, as the coroutine may run and exit with > COROUTINE_TERMINATE before the scheduled coroutine executes. > > (This is the scenario that was occuring and fixed in the previous > patch). > > Signed-off-by: Jeff Cody <jcody@redhat.com> > --- > include/qemu/coroutine_int.h | 6 ++++++ > util/async.c | 11 +++++++++++ > util/qemu-coroutine-sleep.c | 11 +++++++++++ > util/qemu-coroutine.c | 11 +++++++++++ > 4 files changed, 39 insertions(+) > > diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h > index cb98892..56e4c48 100644 > --- a/include/qemu/coroutine_int.h > +++ b/include/qemu/coroutine_int.h > @@ -53,6 +53,12 @@ struct Coroutine { > > /* Only used when the coroutine has yielded. */ > AioContext *ctx; > + > + /* Used to catch and abort on illegal co-routine entry. > + * Will contain the name of the function that had first > + * scheduled the coroutine. */ > + const char *scheduled; Not sure if it makes any difference in practice, but I just want to mention that the new field is right after a cacheline boundary and the only field that is used in qemu_aio_coroutine_enter() and accesses this second cacheline. I'm not paying much attention to this kind of thing in most contexts, but entering a coroutine is a hot path that we want to be fast, so maybe it's worth having a second look. Kevin
On 21/11/2017 14:47, Kevin Wolf wrote: > Am 21.11.2017 um 03:23 hat Jeff Cody geschrieben: >> The previous patch fixed a race condition, in which there were >> coroutines being executing doubly, or after coroutine deletion. >> >> We can detect common scenarios when this happens, and print an error >> message and abort before we corrupt memory / data, or segfault. >> >> This patch will abort if an attempt to enter a coroutine is made while >> it is currently pending execution, either in a specific AioContext bh, >> or pending execution via a timer. It will also abort if a coroutine >> is scheduled, before a prior scheduled run has occured. >> >> We cannot rely on the existing co->caller check for recursive re-entry >> to catch this, as the coroutine may run and exit with >> COROUTINE_TERMINATE before the scheduled coroutine executes. >> >> (This is the scenario that was occuring and fixed in the previous >> patch). >> >> Signed-off-by: Jeff Cody <jcody@redhat.com> >> --- >> include/qemu/coroutine_int.h | 6 ++++++ >> util/async.c | 11 +++++++++++ >> util/qemu-coroutine-sleep.c | 11 +++++++++++ >> util/qemu-coroutine.c | 11 +++++++++++ >> 4 files changed, 39 insertions(+) >> >> diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h >> index cb98892..56e4c48 100644 >> --- a/include/qemu/coroutine_int.h >> +++ b/include/qemu/coroutine_int.h >> @@ -53,6 +53,12 @@ struct Coroutine { >> >> /* Only used when the coroutine has yielded. */ >> AioContext *ctx; >> + >> + /* Used to catch and abort on illegal co-routine entry. >> + * Will contain the name of the function that had first >> + * scheduled the coroutine. */ >> + const char *scheduled; > > Not sure if it makes any difference in practice, but I just want to > mention that the new field is right after a cacheline boundary and > the only field that is used in qemu_aio_coroutine_enter() and accesses > this second cacheline. > > I'm not paying much attention to this kind of thing in most contexts, > but entering a coroutine is a hot path that we want to be fast, so maybe > it's worth having a second look. Makes sense! Since co_queue_wakeup is used on *yield*, maybe the order should be: ctx, scheduled, co_queue_next, co_queue_wakeup, co_scheduled_next. Thanks, Paolo
diff --git a/include/qemu/coroutine_int.h b/include/qemu/coroutine_int.h index cb98892..56e4c48 100644 --- a/include/qemu/coroutine_int.h +++ b/include/qemu/coroutine_int.h @@ -53,6 +53,12 @@ struct Coroutine { /* Only used when the coroutine has yielded. */ AioContext *ctx; + + /* Used to catch and abort on illegal co-routine entry. + * Will contain the name of the function that had first + * scheduled the coroutine. */ + const char *scheduled; + QSIMPLEQ_ENTRY(Coroutine) co_queue_next; QSLIST_ENTRY(Coroutine) co_scheduled_next; }; diff --git a/util/async.c b/util/async.c index 0e1bd87..49174b3 100644 --- a/util/async.c +++ b/util/async.c @@ -388,6 +388,7 @@ static void co_schedule_bh_cb(void *opaque) QSLIST_REMOVE_HEAD(&straight, co_scheduled_next); trace_aio_co_schedule_bh_cb(ctx, co); aio_context_acquire(ctx); + atomic_set(&co->scheduled, NULL); qemu_coroutine_enter(co); aio_context_release(ctx); } @@ -438,6 +439,16 @@ fail: void aio_co_schedule(AioContext *ctx, Coroutine *co) { trace_aio_co_schedule(ctx, co); + const char *scheduled = atomic_read(&co->scheduled); + + if (scheduled) { + fprintf(stderr, + "%s: Co-routine was already scheduled in '%s'\n", + __func__, scheduled); + abort(); + } + atomic_set(&co->scheduled, __func__); + QSLIST_INSERT_HEAD_ATOMIC(&ctx->scheduled_coroutines, co, co_scheduled_next); qemu_bh_schedule(ctx->co_schedule_bh); diff --git a/util/qemu-coroutine-sleep.c b/util/qemu-coroutine-sleep.c index 9c56550..38dc4c8 100644 --- a/util/qemu-coroutine-sleep.c +++ b/util/qemu-coroutine-sleep.c @@ -13,6 +13,7 @@ #include "qemu/osdep.h" #include "qemu/coroutine.h" +#include "qemu/coroutine_int.h" #include "qemu/timer.h" #include "block/aio.h" @@ -25,6 +26,7 @@ static void co_sleep_cb(void *opaque) { CoSleepCB *sleep_cb = opaque; + atomic_set(&sleep_cb->co->scheduled, NULL); aio_co_wake(sleep_cb->co); } @@ -34,6 +36,15 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type, CoSleepCB sleep_cb = { .co = qemu_coroutine_self(), }; + const char *scheduled = atomic_read(&sleep_cb.co->scheduled); + + if (scheduled) { + fprintf(stderr, + "%s: Co-routine was already scheduled in '%s'\n", + __func__, scheduled); + abort(); + } + atomic_set(&sleep_cb.co->scheduled, __func__); sleep_cb.ts = aio_timer_new(ctx, type, SCALE_NS, co_sleep_cb, &sleep_cb); timer_mod(sleep_cb.ts, qemu_clock_get_ns(type) + ns); qemu_coroutine_yield(); diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c index d6095c1..fbfd0ad 100644 --- a/util/qemu-coroutine.c +++ b/util/qemu-coroutine.c @@ -106,9 +106,20 @@ void qemu_aio_coroutine_enter(AioContext *ctx, Coroutine *co) { Coroutine *self = qemu_coroutine_self(); CoroutineAction ret; + const char *scheduled = atomic_read(&co->scheduled); trace_qemu_aio_coroutine_enter(ctx, self, co, co->entry_arg); + /* if the Coroutine has already been scheduled, entering it again will + * cause us to enter it twice, potentially even after the coroutine has + * been deleted */ + if (scheduled) { + fprintf(stderr, + "%s: Co-routine was already scheduled in '%s'\n", + __func__, scheduled); + abort(); + } + if (co->caller) { fprintf(stderr, "Co-routine re-entered recursively\n"); abort();
The previous patch fixed a race condition, in which there were coroutines being executing doubly, or after coroutine deletion. We can detect common scenarios when this happens, and print an error message and abort before we corrupt memory / data, or segfault. This patch will abort if an attempt to enter a coroutine is made while it is currently pending execution, either in a specific AioContext bh, or pending execution via a timer. It will also abort if a coroutine is scheduled, before a prior scheduled run has occured. We cannot rely on the existing co->caller check for recursive re-entry to catch this, as the coroutine may run and exit with COROUTINE_TERMINATE before the scheduled coroutine executes. (This is the scenario that was occuring and fixed in the previous patch). Signed-off-by: Jeff Cody <jcody@redhat.com> --- include/qemu/coroutine_int.h | 6 ++++++ util/async.c | 11 +++++++++++ util/qemu-coroutine-sleep.c | 11 +++++++++++ util/qemu-coroutine.c | 11 +++++++++++ 4 files changed, 39 insertions(+)