diff mbox

[v2] block: Fix bdrv_drain in coroutine

Message ID 1459519058-29864-1-git-send-email-famz@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Fam Zheng April 1, 2016, 1:57 p.m. UTC
Using the nested aio_poll() in coroutine is a bad idea. This patch
replaces the aio_poll loop in bdrv_drain with a BH, if called in
coroutine.

For example, the bdrv_drain() in mirror.c can hang when a guest issued
request is pending on it in qemu_co_mutex_lock().

Mirror coroutine in this case has just finished a request, and the block
job is about to complete. It calls bdrv_drain() which waits for the
other coroutine to complete. The other coroutine is a scsi-disk request.
The deadlock happens when the latter is in turn pending on the former to
yield/terminate, in qemu_co_mutex_lock(). The state flow is as below
(assuming a qcow2 image):

  mirror coroutine               scsi-disk coroutine
  -------------------------------------------------------------
  do last write

    qcow2:qemu_co_mutex_lock()
    ...
                                 scsi disk read

                                   tracked request begin

                                   qcow2:qemu_co_mutex_lock.enter

    qcow2:qemu_co_mutex_unlock()

  bdrv_drain
    while (has tracked request)
      aio_poll()

In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return
because the mirror coroutine is blocked in the aio_poll(blocking=true).

With this patch, the added qemu_coroutine_yield() allows the scsi-disk
coroutine to make progress as expected:

  mirror coroutine               scsi-disk coroutine
  -------------------------------------------------------------
  do last write

    qcow2:qemu_co_mutex_lock()
    ...
                                 scsi disk read

                                   tracked request begin

                                   qcow2:qemu_co_mutex_lock.enter

    qcow2:qemu_co_mutex_unlock()

  bdrv_drain.enter
>   schedule BH
>   qemu_coroutine_yield()
>                                  qcow2:qemu_co_mutex_lock.return
>                                  ...
                                   tracked request end
    ...
    (resumed from BH callback)
  bdrv_drain.return
  ...

Reported-by: Laurent Vivier <lvivier@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>

---

v2: Call qemu_bh_delete() in BH callback. [Paolo]
    Change the loop to an assertion. [Paolo]
    Elaborate a bit about the fix in commit log. [Paolo]
---
 block/io.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

Comments

Laurent Vivier April 1, 2016, 2:14 p.m. UTC | #1
On 01/04/2016 15:57, Fam Zheng wrote:
> Using the nested aio_poll() in coroutine is a bad idea. This patch
> replaces the aio_poll loop in bdrv_drain with a BH, if called in
> coroutine.
> 
> For example, the bdrv_drain() in mirror.c can hang when a guest issued
> request is pending on it in qemu_co_mutex_lock().
> 
> Mirror coroutine in this case has just finished a request, and the block
> job is about to complete. It calls bdrv_drain() which waits for the
> other coroutine to complete. The other coroutine is a scsi-disk request.
> The deadlock happens when the latter is in turn pending on the former to
> yield/terminate, in qemu_co_mutex_lock(). The state flow is as below
> (assuming a qcow2 image):
> 
>   mirror coroutine               scsi-disk coroutine
>   -------------------------------------------------------------
>   do last write
> 
>     qcow2:qemu_co_mutex_lock()
>     ...
>                                  scsi disk read
> 
>                                    tracked request begin
> 
>                                    qcow2:qemu_co_mutex_lock.enter
> 
>     qcow2:qemu_co_mutex_unlock()
> 
>   bdrv_drain
>     while (has tracked request)
>       aio_poll()
> 
> In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return
> because the mirror coroutine is blocked in the aio_poll(blocking=true).
> 
> With this patch, the added qemu_coroutine_yield() allows the scsi-disk
> coroutine to make progress as expected:
> 
>   mirror coroutine               scsi-disk coroutine
>   -------------------------------------------------------------
>   do last write
> 
>     qcow2:qemu_co_mutex_lock()
>     ...
>                                  scsi disk read
> 
>                                    tracked request begin
> 
>                                    qcow2:qemu_co_mutex_lock.enter
> 
>     qcow2:qemu_co_mutex_unlock()
> 
>   bdrv_drain.enter
>>   schedule BH
>>   qemu_coroutine_yield()
>>                                  qcow2:qemu_co_mutex_lock.return
>>                                  ...
>                                    tracked request end
>     ...
>     (resumed from BH callback)
>   bdrv_drain.return
>   ...
> 
> Reported-by: Laurent Vivier <lvivier@redhat.com>
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>

Tested-by: Laurent Vivier <lvivier@redhat.com>
Stefan Hajnoczi April 4, 2016, 11:57 a.m. UTC | #2
On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote:
> Using the nested aio_poll() in coroutine is a bad idea. This patch
> replaces the aio_poll loop in bdrv_drain with a BH, if called in
> coroutine.
> 
> For example, the bdrv_drain() in mirror.c can hang when a guest issued
> request is pending on it in qemu_co_mutex_lock().
> 
> Mirror coroutine in this case has just finished a request, and the block
> job is about to complete. It calls bdrv_drain() which waits for the
> other coroutine to complete. The other coroutine is a scsi-disk request.
> The deadlock happens when the latter is in turn pending on the former to
> yield/terminate, in qemu_co_mutex_lock(). The state flow is as below
> (assuming a qcow2 image):
> 
>   mirror coroutine               scsi-disk coroutine
>   -------------------------------------------------------------
>   do last write
> 
>     qcow2:qemu_co_mutex_lock()
>     ...
>                                  scsi disk read
> 
>                                    tracked request begin
> 
>                                    qcow2:qemu_co_mutex_lock.enter
> 
>     qcow2:qemu_co_mutex_unlock()
> 
>   bdrv_drain
>     while (has tracked request)
>       aio_poll()
> 
> In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return
> because the mirror coroutine is blocked in the aio_poll(blocking=true).
> 
> With this patch, the added qemu_coroutine_yield() allows the scsi-disk
> coroutine to make progress as expected:
> 
>   mirror coroutine               scsi-disk coroutine
>   -------------------------------------------------------------
>   do last write
> 
>     qcow2:qemu_co_mutex_lock()
>     ...
>                                  scsi disk read
> 
>                                    tracked request begin
> 
>                                    qcow2:qemu_co_mutex_lock.enter
> 
>     qcow2:qemu_co_mutex_unlock()
> 
>   bdrv_drain.enter
> >   schedule BH
> >   qemu_coroutine_yield()
> >                                  qcow2:qemu_co_mutex_lock.return
> >                                  ...
>                                    tracked request end
>     ...
>     (resumed from BH callback)
>   bdrv_drain.return
>   ...
> 
> Reported-by: Laurent Vivier <lvivier@redhat.com>
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>
> 
> ---
> 
> v2: Call qemu_bh_delete() in BH callback. [Paolo]
>     Change the loop to an assertion. [Paolo]
>     Elaborate a bit about the fix in commit log. [Paolo]
> ---
>  block/io.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/block/io.c b/block/io.c
> index c4869b9..ddcfb4c 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -253,6 +253,43 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
>      }
>  }
>  
> +typedef struct {
> +    Coroutine *co;
> +    BlockDriverState *bs;
> +    QEMUBH *bh;
> +    bool done;
> +} BdrvCoDrainData;
> +
> +static void bdrv_co_drain_bh_cb(void *opaque)
> +{
> +    BdrvCoDrainData *data = opaque;
> +    Coroutine *co = data->co;
> +
> +    qemu_bh_delete(data->bh);
> +    bdrv_drain(data->bs);
> +    data->done = true;
> +    qemu_coroutine_enter(co, NULL);
> +}
> +
> +static void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
> +{

Please document why a BH is needed:

/* Calling bdrv_drain() from a BH ensures the
 * current coroutine yields and other coroutines run if they were
 * queued from qemu_co_queue_run_restart().
 */

> +    BdrvCoDrainData data;
> +
> +    assert(qemu_in_coroutine());
> +    data = (BdrvCoDrainData) {
> +        .co = qemu_coroutine_self(),
> +        .bs = bs,
> +        .done = false,
> +        .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data),
> +    };
> +    qemu_bh_schedule(data.bh);
> +
> +    qemu_coroutine_yield();
> +    /* If we are resumed from some other event (such as an aio completion or a
> +     * timer callback), it is a bug in the caller that should be fixed. */
> +    assert(data.done);
> +}
> +
>  /*
>   * Wait for pending requests to complete on a single BlockDriverState subtree,
>   * and suspend block driver's internal I/O until next request arrives.
> @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs)
>      bool busy = true;
>  
>      bdrv_drain_recurse(bs);
> +    if (qemu_in_coroutine()) {
> +        bdrv_co_drain(bs);
> +        return;
> +    }
>      while (busy) {
>          /* Keep iterating */
>           bdrv_flush_io_queue(bs);
> -- 
> 2.7.4

block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain()
should assert(!qemu_in_coroutine()).

The reason why existing bdrv_read() and friends detect coroutine context
at runtime is because it is meant for legacy code that runs in both
coroutine and non-coroutine contexts.

Modern coroutine code coroutine code the coroutine interfaces explicitly
instead.

Stefan
Paolo Bonzini April 4, 2016, 2:47 p.m. UTC | #3
On 04/04/2016 13:57, Stefan Hajnoczi wrote:
> On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote:
>> Using the nested aio_poll() in coroutine is a bad idea. This patch
>> replaces the aio_poll loop in bdrv_drain with a BH, if called in
>> coroutine.
>>
>> For example, the bdrv_drain() in mirror.c can hang when a guest issued
>> request is pending on it in qemu_co_mutex_lock().
>>
>> Mirror coroutine in this case has just finished a request, and the block
>> job is about to complete. It calls bdrv_drain() which waits for the
>> other coroutine to complete. The other coroutine is a scsi-disk request.
>> The deadlock happens when the latter is in turn pending on the former to
>> yield/terminate, in qemu_co_mutex_lock(). The state flow is as below
>> (assuming a qcow2 image):
>>
>>   mirror coroutine               scsi-disk coroutine
>>   -------------------------------------------------------------
>>   do last write
>>
>>     qcow2:qemu_co_mutex_lock()
>>     ...
>>                                  scsi disk read
>>
>>                                    tracked request begin
>>
>>                                    qcow2:qemu_co_mutex_lock.enter
>>
>>     qcow2:qemu_co_mutex_unlock()
>>
>>   bdrv_drain
>>     while (has tracked request)
>>       aio_poll()
>>
>> In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return
>> because the mirror coroutine is blocked in the aio_poll(blocking=true).
>>
>> With this patch, the added qemu_coroutine_yield() allows the scsi-disk
>> coroutine to make progress as expected:
>>
>>   mirror coroutine               scsi-disk coroutine
>>   -------------------------------------------------------------
>>   do last write
>>
>>     qcow2:qemu_co_mutex_lock()
>>     ...
>>                                  scsi disk read
>>
>>                                    tracked request begin
>>
>>                                    qcow2:qemu_co_mutex_lock.enter
>>
>>     qcow2:qemu_co_mutex_unlock()
>>
>>   bdrv_drain.enter
>>>   schedule BH
>>>   qemu_coroutine_yield()
>>>                                  qcow2:qemu_co_mutex_lock.return
>>>                                  ...
>>                                    tracked request end
>>     ...
>>     (resumed from BH callback)
>>   bdrv_drain.return
>>   ...
>>
>> Reported-by: Laurent Vivier <lvivier@redhat.com>
>> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Fam Zheng <famz@redhat.com>
>>
>> ---
>>
>> v2: Call qemu_bh_delete() in BH callback. [Paolo]
>>     Change the loop to an assertion. [Paolo]
>>     Elaborate a bit about the fix in commit log. [Paolo]
>> ---
>>  block/io.c | 41 +++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 41 insertions(+)
>>
>> diff --git a/block/io.c b/block/io.c
>> index c4869b9..ddcfb4c 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -253,6 +253,43 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
>>      }
>>  }
>>  
>> +typedef struct {
>> +    Coroutine *co;
>> +    BlockDriverState *bs;
>> +    QEMUBH *bh;
>> +    bool done;
>> +} BdrvCoDrainData;
>> +
>> +static void bdrv_co_drain_bh_cb(void *opaque)
>> +{
>> +    BdrvCoDrainData *data = opaque;
>> +    Coroutine *co = data->co;
>> +
>> +    qemu_bh_delete(data->bh);
>> +    bdrv_drain(data->bs);
>> +    data->done = true;
>> +    qemu_coroutine_enter(co, NULL);
>> +}
>> +
>> +static void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
>> +{
> 
> Please document why a BH is needed:
> 
> /* Calling bdrv_drain() from a BH ensures the
>  * current coroutine yields and other coroutines run if they were
>  * queued from qemu_co_queue_run_restart().
>  */
> 
>> +    BdrvCoDrainData data;
>> +
>> +    assert(qemu_in_coroutine());
>> +    data = (BdrvCoDrainData) {
>> +        .co = qemu_coroutine_self(),
>> +        .bs = bs,
>> +        .done = false,
>> +        .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data),
>> +    };
>> +    qemu_bh_schedule(data.bh);
>> +
>> +    qemu_coroutine_yield();
>> +    /* If we are resumed from some other event (such as an aio completion or a
>> +     * timer callback), it is a bug in the caller that should be fixed. */
>> +    assert(data.done);
>> +}
>> +
>>  /*
>>   * Wait for pending requests to complete on a single BlockDriverState subtree,
>>   * and suspend block driver's internal I/O until next request arrives.
>> @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs)
>>      bool busy = true;
>>  
>>      bdrv_drain_recurse(bs);
>> +    if (qemu_in_coroutine()) {
>> +        bdrv_co_drain(bs);
>> +        return;
>> +    }
>>      while (busy) {
>>          /* Keep iterating */
>>           bdrv_flush_io_queue(bs);
>> -- 
>> 2.7.4
> 
> block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain()
> should assert(!qemu_in_coroutine()).
> 
> The reason why existing bdrv_read() and friends detect coroutine context
> at runtime is because it is meant for legacy code that runs in both
> coroutine and non-coroutine contexts.
> 
> Modern coroutine code coroutine code the coroutine interfaces explicitly
> instead.

For what it's worth, I suspect Fam's patch removes the need for
http://article.gmane.org/gmane.comp.emulators.qemu/401375.  That's a
nice bonus. :)

Paolo
Fam Zheng April 5, 2016, 1:27 a.m. UTC | #4
On Mon, 04/04 12:57, Stefan Hajnoczi wrote:
> On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote:
> > Using the nested aio_poll() in coroutine is a bad idea. This patch
> > replaces the aio_poll loop in bdrv_drain with a BH, if called in
> > coroutine.
> > 
> > For example, the bdrv_drain() in mirror.c can hang when a guest issued
> > request is pending on it in qemu_co_mutex_lock().
> > 
> > Mirror coroutine in this case has just finished a request, and the block
> > job is about to complete. It calls bdrv_drain() which waits for the
> > other coroutine to complete. The other coroutine is a scsi-disk request.
> > The deadlock happens when the latter is in turn pending on the former to
> > yield/terminate, in qemu_co_mutex_lock(). The state flow is as below
> > (assuming a qcow2 image):
> > 
> >   mirror coroutine               scsi-disk coroutine
> >   -------------------------------------------------------------
> >   do last write
> > 
> >     qcow2:qemu_co_mutex_lock()
> >     ...
> >                                  scsi disk read
> > 
> >                                    tracked request begin
> > 
> >                                    qcow2:qemu_co_mutex_lock.enter
> > 
> >     qcow2:qemu_co_mutex_unlock()
> > 
> >   bdrv_drain
> >     while (has tracked request)
> >       aio_poll()
> > 
> > In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return
> > because the mirror coroutine is blocked in the aio_poll(blocking=true).
> > 
> > With this patch, the added qemu_coroutine_yield() allows the scsi-disk
> > coroutine to make progress as expected:
> > 
> >   mirror coroutine               scsi-disk coroutine
> >   -------------------------------------------------------------
> >   do last write
> > 
> >     qcow2:qemu_co_mutex_lock()
> >     ...
> >                                  scsi disk read
> > 
> >                                    tracked request begin
> > 
> >                                    qcow2:qemu_co_mutex_lock.enter
> > 
> >     qcow2:qemu_co_mutex_unlock()
> > 
> >   bdrv_drain.enter
> > >   schedule BH
> > >   qemu_coroutine_yield()
> > >                                  qcow2:qemu_co_mutex_lock.return
> > >                                  ...
> >                                    tracked request end
> >     ...
> >     (resumed from BH callback)
> >   bdrv_drain.return
> >   ...
> > 
> > Reported-by: Laurent Vivier <lvivier@redhat.com>
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > 
> > ---
> > 
> > v2: Call qemu_bh_delete() in BH callback. [Paolo]
> >     Change the loop to an assertion. [Paolo]
> >     Elaborate a bit about the fix in commit log. [Paolo]
> > ---
> >  block/io.c | 41 +++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 41 insertions(+)
> > 
> > diff --git a/block/io.c b/block/io.c
> > index c4869b9..ddcfb4c 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -253,6 +253,43 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
> >      }
> >  }
> >  
> > +typedef struct {
> > +    Coroutine *co;
> > +    BlockDriverState *bs;
> > +    QEMUBH *bh;
> > +    bool done;
> > +} BdrvCoDrainData;
> > +
> > +static void bdrv_co_drain_bh_cb(void *opaque)
> > +{
> > +    BdrvCoDrainData *data = opaque;
> > +    Coroutine *co = data->co;
> > +
> > +    qemu_bh_delete(data->bh);
> > +    bdrv_drain(data->bs);
> > +    data->done = true;
> > +    qemu_coroutine_enter(co, NULL);
> > +}
> > +
> > +static void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
> > +{
> 
> Please document why a BH is needed:
> 
> /* Calling bdrv_drain() from a BH ensures the
>  * current coroutine yields and other coroutines run if they were
>  * queued from qemu_co_queue_run_restart().
>  */

OK.

> 
> > +    BdrvCoDrainData data;
> > +
> > +    assert(qemu_in_coroutine());
> > +    data = (BdrvCoDrainData) {
> > +        .co = qemu_coroutine_self(),
> > +        .bs = bs,
> > +        .done = false,
> > +        .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data),
> > +    };
> > +    qemu_bh_schedule(data.bh);
> > +
> > +    qemu_coroutine_yield();
> > +    /* If we are resumed from some other event (such as an aio completion or a
> > +     * timer callback), it is a bug in the caller that should be fixed. */
> > +    assert(data.done);
> > +}
> > +
> >  /*
> >   * Wait for pending requests to complete on a single BlockDriverState subtree,
> >   * and suspend block driver's internal I/O until next request arrives.
> > @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs)
> >      bool busy = true;
> >  
> >      bdrv_drain_recurse(bs);
> > +    if (qemu_in_coroutine()) {
> > +        bdrv_co_drain(bs);
> > +        return;
> > +    }
> >      while (busy) {
> >          /* Keep iterating */
> >           bdrv_flush_io_queue(bs);
> > -- 
> > 2.7.4
> 
> block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain()
> should assert(!qemu_in_coroutine()).
> 
> The reason why existing bdrv_read() and friends detect coroutine context
> at runtime is because it is meant for legacy code that runs in both
> coroutine and non-coroutine contexts.

blk_unref/bdrv_unref are called in coroutine (.bdrv_create implementations),
and this doesn't just work with the assertion. Should I clean up this "legacy"
code first, i.e. move bdrv_unref calls to BHs in the callers and
assert(!qemu_in_coroutine()) there too? I didn't think this because it
complicates the code somehow.

> 
> Modern coroutine code coroutine code the coroutine interfaces explicitly
> instead.
Stefan Hajnoczi April 5, 2016, 9:39 a.m. UTC | #5
On Tue, Apr 05, 2016 at 09:27:39AM +0800, Fam Zheng wrote:
> On Mon, 04/04 12:57, Stefan Hajnoczi wrote:
> > On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote:
> > > +    BdrvCoDrainData data;
> > > +
> > > +    assert(qemu_in_coroutine());
> > > +    data = (BdrvCoDrainData) {
> > > +        .co = qemu_coroutine_self(),
> > > +        .bs = bs,
> > > +        .done = false,
> > > +        .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data),
> > > +    };
> > > +    qemu_bh_schedule(data.bh);
> > > +
> > > +    qemu_coroutine_yield();
> > > +    /* If we are resumed from some other event (such as an aio completion or a
> > > +     * timer callback), it is a bug in the caller that should be fixed. */
> > > +    assert(data.done);
> > > +}
> > > +
> > >  /*
> > >   * Wait for pending requests to complete on a single BlockDriverState subtree,
> > >   * and suspend block driver's internal I/O until next request arrives.
> > > @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs)
> > >      bool busy = true;
> > >  
> > >      bdrv_drain_recurse(bs);
> > > +    if (qemu_in_coroutine()) {
> > > +        bdrv_co_drain(bs);
> > > +        return;
> > > +    }
> > >      while (busy) {
> > >          /* Keep iterating */
> > >           bdrv_flush_io_queue(bs);
> > > -- 
> > > 2.7.4
> > 
> > block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain()
> > should assert(!qemu_in_coroutine()).
> > 
> > The reason why existing bdrv_read() and friends detect coroutine context
> > at runtime is because it is meant for legacy code that runs in both
> > coroutine and non-coroutine contexts.
> 
> blk_unref/bdrv_unref are called in coroutine (.bdrv_create implementations),
> and this doesn't just work with the assertion. Should I clean up this "legacy"
> code first, i.e. move bdrv_unref calls to BHs in the callers and
> assert(!qemu_in_coroutine()) there too? I didn't think this because it
> complicates the code somehow.

This is a messy problem.

In general I don't like introducing yields into non-coroutine_fn
functions because it can lead to bugs when the caller didn't expect a
yield point.

For example, I myself wouldn't have expected bdrv_unref() to be a yield
point.  So maybe coroutine code I've written would be vulnerable to
interference (I won't call it a race condition) from another coroutine
across the bdrv_unref() call.  This could mean that another coroutine
now sees intermediate state that would never be visible without the new
yield point.

I think attempting to invoke qemu_co_queue_run_restart() directly
instead of scheduling a BH and yielding does not improve the situation.
It's also a layering violation since qemu_co_queue_run_restart() is just
meant for the core coroutine code and isn't a public interface.

Anyway, let's consider bdrv_drain() legacy code that can call if
(qemu_in_coroutine()) but please make bdrv_co_drain() public so
block/mirror.c can at least call it directly.

Stefan
Fam Zheng April 5, 2016, 11:15 a.m. UTC | #6
On Tue, 04/05 10:39, Stefan Hajnoczi wrote:
> > > block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain()
> > > should assert(!qemu_in_coroutine()).
> > > 
> > > The reason why existing bdrv_read() and friends detect coroutine context
> > > at runtime is because it is meant for legacy code that runs in both
> > > coroutine and non-coroutine contexts.
> > 
> > blk_unref/bdrv_unref are called in coroutine (.bdrv_create implementations),
> > and this doesn't just work with the assertion. Should I clean up this "legacy"
> > code first, i.e. move bdrv_unref calls to BHs in the callers and
> > assert(!qemu_in_coroutine()) there too? I didn't think this because it
> > complicates the code somehow.
> 
> This is a messy problem.
> 
> In general I don't like introducing yields into non-coroutine_fn
> functions because it can lead to bugs when the caller didn't expect a
> yield point.
> 
> For example, I myself wouldn't have expected bdrv_unref() to be a yield
> point.  So maybe coroutine code I've written would be vulnerable to
> interference (I won't call it a race condition) from another coroutine
> across the bdrv_unref() call.  This could mean that another coroutine
> now sees intermediate state that would never be visible without the new
> yield point.
> 
> I think attempting to invoke qemu_co_queue_run_restart() directly
> instead of scheduling a BH and yielding does not improve the situation.
> It's also a layering violation since qemu_co_queue_run_restart() is just
> meant for the core coroutine code and isn't a public interface.
> 
> Anyway, let's consider bdrv_drain() legacy code that can call if
> (qemu_in_coroutine()) but please make bdrv_co_drain() public so
> block/mirror.c can at least call it directly.

OK, will do.

Fam
Paolo Bonzini April 5, 2016, 12:39 p.m. UTC | #7
On 05/04/2016 13:15, Fam Zheng wrote:
> > Anyway, let's consider bdrv_drain() legacy code that can call if
> > (qemu_in_coroutine()) but please make bdrv_co_drain() public so
> > block/mirror.c can at least call it directly.
> OK, will do.

Please create a bdrv_co_drained_begin() function too, even if it's not
used yet.

Paolo
diff mbox

Patch

diff --git a/block/io.c b/block/io.c
index c4869b9..ddcfb4c 100644
--- a/block/io.c
+++ b/block/io.c
@@ -253,6 +253,43 @@  static void bdrv_drain_recurse(BlockDriverState *bs)
     }
 }
 
+typedef struct {
+    Coroutine *co;
+    BlockDriverState *bs;
+    QEMUBH *bh;
+    bool done;
+} BdrvCoDrainData;
+
+static void bdrv_co_drain_bh_cb(void *opaque)
+{
+    BdrvCoDrainData *data = opaque;
+    Coroutine *co = data->co;
+
+    qemu_bh_delete(data->bh);
+    bdrv_drain(data->bs);
+    data->done = true;
+    qemu_coroutine_enter(co, NULL);
+}
+
+static void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
+{
+    BdrvCoDrainData data;
+
+    assert(qemu_in_coroutine());
+    data = (BdrvCoDrainData) {
+        .co = qemu_coroutine_self(),
+        .bs = bs,
+        .done = false,
+        .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data),
+    };
+    qemu_bh_schedule(data.bh);
+
+    qemu_coroutine_yield();
+    /* If we are resumed from some other event (such as an aio completion or a
+     * timer callback), it is a bug in the caller that should be fixed. */
+    assert(data.done);
+}
+
 /*
  * Wait for pending requests to complete on a single BlockDriverState subtree,
  * and suspend block driver's internal I/O until next request arrives.
@@ -269,6 +306,10 @@  void bdrv_drain(BlockDriverState *bs)
     bool busy = true;
 
     bdrv_drain_recurse(bs);
+    if (qemu_in_coroutine()) {
+        bdrv_co_drain(bs);
+        return;
+    }
     while (busy) {
         /* Keep iterating */
          bdrv_flush_io_queue(bs);