Message ID | 1459519058-29864-1-git-send-email-famz@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 01/04/2016 15:57, Fam Zheng wrote: > Using the nested aio_poll() in coroutine is a bad idea. This patch > replaces the aio_poll loop in bdrv_drain with a BH, if called in > coroutine. > > For example, the bdrv_drain() in mirror.c can hang when a guest issued > request is pending on it in qemu_co_mutex_lock(). > > Mirror coroutine in this case has just finished a request, and the block > job is about to complete. It calls bdrv_drain() which waits for the > other coroutine to complete. The other coroutine is a scsi-disk request. > The deadlock happens when the latter is in turn pending on the former to > yield/terminate, in qemu_co_mutex_lock(). The state flow is as below > (assuming a qcow2 image): > > mirror coroutine scsi-disk coroutine > ------------------------------------------------------------- > do last write > > qcow2:qemu_co_mutex_lock() > ... > scsi disk read > > tracked request begin > > qcow2:qemu_co_mutex_lock.enter > > qcow2:qemu_co_mutex_unlock() > > bdrv_drain > while (has tracked request) > aio_poll() > > In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return > because the mirror coroutine is blocked in the aio_poll(blocking=true). > > With this patch, the added qemu_coroutine_yield() allows the scsi-disk > coroutine to make progress as expected: > > mirror coroutine scsi-disk coroutine > ------------------------------------------------------------- > do last write > > qcow2:qemu_co_mutex_lock() > ... > scsi disk read > > tracked request begin > > qcow2:qemu_co_mutex_lock.enter > > qcow2:qemu_co_mutex_unlock() > > bdrv_drain.enter >> schedule BH >> qemu_coroutine_yield() >> qcow2:qemu_co_mutex_lock.return >> ... > tracked request end > ... > (resumed from BH callback) > bdrv_drain.return > ... > > Reported-by: Laurent Vivier <lvivier@redhat.com> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com> > Signed-off-by: Fam Zheng <famz@redhat.com> Tested-by: Laurent Vivier <lvivier@redhat.com>
On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote: > Using the nested aio_poll() in coroutine is a bad idea. This patch > replaces the aio_poll loop in bdrv_drain with a BH, if called in > coroutine. > > For example, the bdrv_drain() in mirror.c can hang when a guest issued > request is pending on it in qemu_co_mutex_lock(). > > Mirror coroutine in this case has just finished a request, and the block > job is about to complete. It calls bdrv_drain() which waits for the > other coroutine to complete. The other coroutine is a scsi-disk request. > The deadlock happens when the latter is in turn pending on the former to > yield/terminate, in qemu_co_mutex_lock(). The state flow is as below > (assuming a qcow2 image): > > mirror coroutine scsi-disk coroutine > ------------------------------------------------------------- > do last write > > qcow2:qemu_co_mutex_lock() > ... > scsi disk read > > tracked request begin > > qcow2:qemu_co_mutex_lock.enter > > qcow2:qemu_co_mutex_unlock() > > bdrv_drain > while (has tracked request) > aio_poll() > > In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return > because the mirror coroutine is blocked in the aio_poll(blocking=true). > > With this patch, the added qemu_coroutine_yield() allows the scsi-disk > coroutine to make progress as expected: > > mirror coroutine scsi-disk coroutine > ------------------------------------------------------------- > do last write > > qcow2:qemu_co_mutex_lock() > ... > scsi disk read > > tracked request begin > > qcow2:qemu_co_mutex_lock.enter > > qcow2:qemu_co_mutex_unlock() > > bdrv_drain.enter > > schedule BH > > qemu_coroutine_yield() > > qcow2:qemu_co_mutex_lock.return > > ... > tracked request end > ... > (resumed from BH callback) > bdrv_drain.return > ... > > Reported-by: Laurent Vivier <lvivier@redhat.com> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com> > Signed-off-by: Fam Zheng <famz@redhat.com> > > --- > > v2: Call qemu_bh_delete() in BH callback. [Paolo] > Change the loop to an assertion. [Paolo] > Elaborate a bit about the fix in commit log. [Paolo] > --- > block/io.c | 41 +++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 41 insertions(+) > > diff --git a/block/io.c b/block/io.c > index c4869b9..ddcfb4c 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -253,6 +253,43 @@ static void bdrv_drain_recurse(BlockDriverState *bs) > } > } > > +typedef struct { > + Coroutine *co; > + BlockDriverState *bs; > + QEMUBH *bh; > + bool done; > +} BdrvCoDrainData; > + > +static void bdrv_co_drain_bh_cb(void *opaque) > +{ > + BdrvCoDrainData *data = opaque; > + Coroutine *co = data->co; > + > + qemu_bh_delete(data->bh); > + bdrv_drain(data->bs); > + data->done = true; > + qemu_coroutine_enter(co, NULL); > +} > + > +static void coroutine_fn bdrv_co_drain(BlockDriverState *bs) > +{ Please document why a BH is needed: /* Calling bdrv_drain() from a BH ensures the * current coroutine yields and other coroutines run if they were * queued from qemu_co_queue_run_restart(). */ > + BdrvCoDrainData data; > + > + assert(qemu_in_coroutine()); > + data = (BdrvCoDrainData) { > + .co = qemu_coroutine_self(), > + .bs = bs, > + .done = false, > + .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data), > + }; > + qemu_bh_schedule(data.bh); > + > + qemu_coroutine_yield(); > + /* If we are resumed from some other event (such as an aio completion or a > + * timer callback), it is a bug in the caller that should be fixed. */ > + assert(data.done); > +} > + > /* > * Wait for pending requests to complete on a single BlockDriverState subtree, > * and suspend block driver's internal I/O until next request arrives. > @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs) > bool busy = true; > > bdrv_drain_recurse(bs); > + if (qemu_in_coroutine()) { > + bdrv_co_drain(bs); > + return; > + } > while (busy) { > /* Keep iterating */ > bdrv_flush_io_queue(bs); > -- > 2.7.4 block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain() should assert(!qemu_in_coroutine()). The reason why existing bdrv_read() and friends detect coroutine context at runtime is because it is meant for legacy code that runs in both coroutine and non-coroutine contexts. Modern coroutine code coroutine code the coroutine interfaces explicitly instead. Stefan
On 04/04/2016 13:57, Stefan Hajnoczi wrote: > On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote: >> Using the nested aio_poll() in coroutine is a bad idea. This patch >> replaces the aio_poll loop in bdrv_drain with a BH, if called in >> coroutine. >> >> For example, the bdrv_drain() in mirror.c can hang when a guest issued >> request is pending on it in qemu_co_mutex_lock(). >> >> Mirror coroutine in this case has just finished a request, and the block >> job is about to complete. It calls bdrv_drain() which waits for the >> other coroutine to complete. The other coroutine is a scsi-disk request. >> The deadlock happens when the latter is in turn pending on the former to >> yield/terminate, in qemu_co_mutex_lock(). The state flow is as below >> (assuming a qcow2 image): >> >> mirror coroutine scsi-disk coroutine >> ------------------------------------------------------------- >> do last write >> >> qcow2:qemu_co_mutex_lock() >> ... >> scsi disk read >> >> tracked request begin >> >> qcow2:qemu_co_mutex_lock.enter >> >> qcow2:qemu_co_mutex_unlock() >> >> bdrv_drain >> while (has tracked request) >> aio_poll() >> >> In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return >> because the mirror coroutine is blocked in the aio_poll(blocking=true). >> >> With this patch, the added qemu_coroutine_yield() allows the scsi-disk >> coroutine to make progress as expected: >> >> mirror coroutine scsi-disk coroutine >> ------------------------------------------------------------- >> do last write >> >> qcow2:qemu_co_mutex_lock() >> ... >> scsi disk read >> >> tracked request begin >> >> qcow2:qemu_co_mutex_lock.enter >> >> qcow2:qemu_co_mutex_unlock() >> >> bdrv_drain.enter >>> schedule BH >>> qemu_coroutine_yield() >>> qcow2:qemu_co_mutex_lock.return >>> ... >> tracked request end >> ... >> (resumed from BH callback) >> bdrv_drain.return >> ... >> >> Reported-by: Laurent Vivier <lvivier@redhat.com> >> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> >> Signed-off-by: Fam Zheng <famz@redhat.com> >> >> --- >> >> v2: Call qemu_bh_delete() in BH callback. [Paolo] >> Change the loop to an assertion. [Paolo] >> Elaborate a bit about the fix in commit log. [Paolo] >> --- >> block/io.c | 41 +++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 41 insertions(+) >> >> diff --git a/block/io.c b/block/io.c >> index c4869b9..ddcfb4c 100644 >> --- a/block/io.c >> +++ b/block/io.c >> @@ -253,6 +253,43 @@ static void bdrv_drain_recurse(BlockDriverState *bs) >> } >> } >> >> +typedef struct { >> + Coroutine *co; >> + BlockDriverState *bs; >> + QEMUBH *bh; >> + bool done; >> +} BdrvCoDrainData; >> + >> +static void bdrv_co_drain_bh_cb(void *opaque) >> +{ >> + BdrvCoDrainData *data = opaque; >> + Coroutine *co = data->co; >> + >> + qemu_bh_delete(data->bh); >> + bdrv_drain(data->bs); >> + data->done = true; >> + qemu_coroutine_enter(co, NULL); >> +} >> + >> +static void coroutine_fn bdrv_co_drain(BlockDriverState *bs) >> +{ > > Please document why a BH is needed: > > /* Calling bdrv_drain() from a BH ensures the > * current coroutine yields and other coroutines run if they were > * queued from qemu_co_queue_run_restart(). > */ > >> + BdrvCoDrainData data; >> + >> + assert(qemu_in_coroutine()); >> + data = (BdrvCoDrainData) { >> + .co = qemu_coroutine_self(), >> + .bs = bs, >> + .done = false, >> + .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data), >> + }; >> + qemu_bh_schedule(data.bh); >> + >> + qemu_coroutine_yield(); >> + /* If we are resumed from some other event (such as an aio completion or a >> + * timer callback), it is a bug in the caller that should be fixed. */ >> + assert(data.done); >> +} >> + >> /* >> * Wait for pending requests to complete on a single BlockDriverState subtree, >> * and suspend block driver's internal I/O until next request arrives. >> @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs) >> bool busy = true; >> >> bdrv_drain_recurse(bs); >> + if (qemu_in_coroutine()) { >> + bdrv_co_drain(bs); >> + return; >> + } >> while (busy) { >> /* Keep iterating */ >> bdrv_flush_io_queue(bs); >> -- >> 2.7.4 > > block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain() > should assert(!qemu_in_coroutine()). > > The reason why existing bdrv_read() and friends detect coroutine context > at runtime is because it is meant for legacy code that runs in both > coroutine and non-coroutine contexts. > > Modern coroutine code coroutine code the coroutine interfaces explicitly > instead. For what it's worth, I suspect Fam's patch removes the need for http://article.gmane.org/gmane.comp.emulators.qemu/401375. That's a nice bonus. :) Paolo
On Mon, 04/04 12:57, Stefan Hajnoczi wrote: > On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote: > > Using the nested aio_poll() in coroutine is a bad idea. This patch > > replaces the aio_poll loop in bdrv_drain with a BH, if called in > > coroutine. > > > > For example, the bdrv_drain() in mirror.c can hang when a guest issued > > request is pending on it in qemu_co_mutex_lock(). > > > > Mirror coroutine in this case has just finished a request, and the block > > job is about to complete. It calls bdrv_drain() which waits for the > > other coroutine to complete. The other coroutine is a scsi-disk request. > > The deadlock happens when the latter is in turn pending on the former to > > yield/terminate, in qemu_co_mutex_lock(). The state flow is as below > > (assuming a qcow2 image): > > > > mirror coroutine scsi-disk coroutine > > ------------------------------------------------------------- > > do last write > > > > qcow2:qemu_co_mutex_lock() > > ... > > scsi disk read > > > > tracked request begin > > > > qcow2:qemu_co_mutex_lock.enter > > > > qcow2:qemu_co_mutex_unlock() > > > > bdrv_drain > > while (has tracked request) > > aio_poll() > > > > In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return > > because the mirror coroutine is blocked in the aio_poll(blocking=true). > > > > With this patch, the added qemu_coroutine_yield() allows the scsi-disk > > coroutine to make progress as expected: > > > > mirror coroutine scsi-disk coroutine > > ------------------------------------------------------------- > > do last write > > > > qcow2:qemu_co_mutex_lock() > > ... > > scsi disk read > > > > tracked request begin > > > > qcow2:qemu_co_mutex_lock.enter > > > > qcow2:qemu_co_mutex_unlock() > > > > bdrv_drain.enter > > > schedule BH > > > qemu_coroutine_yield() > > > qcow2:qemu_co_mutex_lock.return > > > ... > > tracked request end > > ... > > (resumed from BH callback) > > bdrv_drain.return > > ... > > > > Reported-by: Laurent Vivier <lvivier@redhat.com> > > Suggested-by: Paolo Bonzini <pbonzini@redhat.com> > > Signed-off-by: Fam Zheng <famz@redhat.com> > > > > --- > > > > v2: Call qemu_bh_delete() in BH callback. [Paolo] > > Change the loop to an assertion. [Paolo] > > Elaborate a bit about the fix in commit log. [Paolo] > > --- > > block/io.c | 41 +++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 41 insertions(+) > > > > diff --git a/block/io.c b/block/io.c > > index c4869b9..ddcfb4c 100644 > > --- a/block/io.c > > +++ b/block/io.c > > @@ -253,6 +253,43 @@ static void bdrv_drain_recurse(BlockDriverState *bs) > > } > > } > > > > +typedef struct { > > + Coroutine *co; > > + BlockDriverState *bs; > > + QEMUBH *bh; > > + bool done; > > +} BdrvCoDrainData; > > + > > +static void bdrv_co_drain_bh_cb(void *opaque) > > +{ > > + BdrvCoDrainData *data = opaque; > > + Coroutine *co = data->co; > > + > > + qemu_bh_delete(data->bh); > > + bdrv_drain(data->bs); > > + data->done = true; > > + qemu_coroutine_enter(co, NULL); > > +} > > + > > +static void coroutine_fn bdrv_co_drain(BlockDriverState *bs) > > +{ > > Please document why a BH is needed: > > /* Calling bdrv_drain() from a BH ensures the > * current coroutine yields and other coroutines run if they were > * queued from qemu_co_queue_run_restart(). > */ OK. > > > + BdrvCoDrainData data; > > + > > + assert(qemu_in_coroutine()); > > + data = (BdrvCoDrainData) { > > + .co = qemu_coroutine_self(), > > + .bs = bs, > > + .done = false, > > + .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data), > > + }; > > + qemu_bh_schedule(data.bh); > > + > > + qemu_coroutine_yield(); > > + /* If we are resumed from some other event (such as an aio completion or a > > + * timer callback), it is a bug in the caller that should be fixed. */ > > + assert(data.done); > > +} > > + > > /* > > * Wait for pending requests to complete on a single BlockDriverState subtree, > > * and suspend block driver's internal I/O until next request arrives. > > @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs) > > bool busy = true; > > > > bdrv_drain_recurse(bs); > > + if (qemu_in_coroutine()) { > > + bdrv_co_drain(bs); > > + return; > > + } > > while (busy) { > > /* Keep iterating */ > > bdrv_flush_io_queue(bs); > > -- > > 2.7.4 > > block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain() > should assert(!qemu_in_coroutine()). > > The reason why existing bdrv_read() and friends detect coroutine context > at runtime is because it is meant for legacy code that runs in both > coroutine and non-coroutine contexts. blk_unref/bdrv_unref are called in coroutine (.bdrv_create implementations), and this doesn't just work with the assertion. Should I clean up this "legacy" code first, i.e. move bdrv_unref calls to BHs in the callers and assert(!qemu_in_coroutine()) there too? I didn't think this because it complicates the code somehow. > > Modern coroutine code coroutine code the coroutine interfaces explicitly > instead.
On Tue, Apr 05, 2016 at 09:27:39AM +0800, Fam Zheng wrote: > On Mon, 04/04 12:57, Stefan Hajnoczi wrote: > > On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote: > > > + BdrvCoDrainData data; > > > + > > > + assert(qemu_in_coroutine()); > > > + data = (BdrvCoDrainData) { > > > + .co = qemu_coroutine_self(), > > > + .bs = bs, > > > + .done = false, > > > + .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data), > > > + }; > > > + qemu_bh_schedule(data.bh); > > > + > > > + qemu_coroutine_yield(); > > > + /* If we are resumed from some other event (such as an aio completion or a > > > + * timer callback), it is a bug in the caller that should be fixed. */ > > > + assert(data.done); > > > +} > > > + > > > /* > > > * Wait for pending requests to complete on a single BlockDriverState subtree, > > > * and suspend block driver's internal I/O until next request arrives. > > > @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs) > > > bool busy = true; > > > > > > bdrv_drain_recurse(bs); > > > + if (qemu_in_coroutine()) { > > > + bdrv_co_drain(bs); > > > + return; > > > + } > > > while (busy) { > > > /* Keep iterating */ > > > bdrv_flush_io_queue(bs); > > > -- > > > 2.7.4 > > > > block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain() > > should assert(!qemu_in_coroutine()). > > > > The reason why existing bdrv_read() and friends detect coroutine context > > at runtime is because it is meant for legacy code that runs in both > > coroutine and non-coroutine contexts. > > blk_unref/bdrv_unref are called in coroutine (.bdrv_create implementations), > and this doesn't just work with the assertion. Should I clean up this "legacy" > code first, i.e. move bdrv_unref calls to BHs in the callers and > assert(!qemu_in_coroutine()) there too? I didn't think this because it > complicates the code somehow. This is a messy problem. In general I don't like introducing yields into non-coroutine_fn functions because it can lead to bugs when the caller didn't expect a yield point. For example, I myself wouldn't have expected bdrv_unref() to be a yield point. So maybe coroutine code I've written would be vulnerable to interference (I won't call it a race condition) from another coroutine across the bdrv_unref() call. This could mean that another coroutine now sees intermediate state that would never be visible without the new yield point. I think attempting to invoke qemu_co_queue_run_restart() directly instead of scheduling a BH and yielding does not improve the situation. It's also a layering violation since qemu_co_queue_run_restart() is just meant for the core coroutine code and isn't a public interface. Anyway, let's consider bdrv_drain() legacy code that can call if (qemu_in_coroutine()) but please make bdrv_co_drain() public so block/mirror.c can at least call it directly. Stefan
On Tue, 04/05 10:39, Stefan Hajnoczi wrote: > > > block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain() > > > should assert(!qemu_in_coroutine()). > > > > > > The reason why existing bdrv_read() and friends detect coroutine context > > > at runtime is because it is meant for legacy code that runs in both > > > coroutine and non-coroutine contexts. > > > > blk_unref/bdrv_unref are called in coroutine (.bdrv_create implementations), > > and this doesn't just work with the assertion. Should I clean up this "legacy" > > code first, i.e. move bdrv_unref calls to BHs in the callers and > > assert(!qemu_in_coroutine()) there too? I didn't think this because it > > complicates the code somehow. > > This is a messy problem. > > In general I don't like introducing yields into non-coroutine_fn > functions because it can lead to bugs when the caller didn't expect a > yield point. > > For example, I myself wouldn't have expected bdrv_unref() to be a yield > point. So maybe coroutine code I've written would be vulnerable to > interference (I won't call it a race condition) from another coroutine > across the bdrv_unref() call. This could mean that another coroutine > now sees intermediate state that would never be visible without the new > yield point. > > I think attempting to invoke qemu_co_queue_run_restart() directly > instead of scheduling a BH and yielding does not improve the situation. > It's also a layering violation since qemu_co_queue_run_restart() is just > meant for the core coroutine code and isn't a public interface. > > Anyway, let's consider bdrv_drain() legacy code that can call if > (qemu_in_coroutine()) but please make bdrv_co_drain() public so > block/mirror.c can at least call it directly. OK, will do. Fam
On 05/04/2016 13:15, Fam Zheng wrote: > > Anyway, let's consider bdrv_drain() legacy code that can call if > > (qemu_in_coroutine()) but please make bdrv_co_drain() public so > > block/mirror.c can at least call it directly. > OK, will do. Please create a bdrv_co_drained_begin() function too, even if it's not used yet. Paolo
diff --git a/block/io.c b/block/io.c index c4869b9..ddcfb4c 100644 --- a/block/io.c +++ b/block/io.c @@ -253,6 +253,43 @@ static void bdrv_drain_recurse(BlockDriverState *bs) } } +typedef struct { + Coroutine *co; + BlockDriverState *bs; + QEMUBH *bh; + bool done; +} BdrvCoDrainData; + +static void bdrv_co_drain_bh_cb(void *opaque) +{ + BdrvCoDrainData *data = opaque; + Coroutine *co = data->co; + + qemu_bh_delete(data->bh); + bdrv_drain(data->bs); + data->done = true; + qemu_coroutine_enter(co, NULL); +} + +static void coroutine_fn bdrv_co_drain(BlockDriverState *bs) +{ + BdrvCoDrainData data; + + assert(qemu_in_coroutine()); + data = (BdrvCoDrainData) { + .co = qemu_coroutine_self(), + .bs = bs, + .done = false, + .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data), + }; + qemu_bh_schedule(data.bh); + + qemu_coroutine_yield(); + /* If we are resumed from some other event (such as an aio completion or a + * timer callback), it is a bug in the caller that should be fixed. */ + assert(data.done); +} + /* * Wait for pending requests to complete on a single BlockDriverState subtree, * and suspend block driver's internal I/O until next request arrives. @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs) bool busy = true; bdrv_drain_recurse(bs); + if (qemu_in_coroutine()) { + bdrv_co_drain(bs); + return; + } while (busy) { /* Keep iterating */ bdrv_flush_io_queue(bs);