diff mbox

[v2,1/1] iotests: fix test case 185

Message ID 20180323034356.72130-2-haoqf@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Hao QingFeng March 23, 2018, 3:43 a.m. UTC
Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()".
It's because of the newly introduced function vm_shutdown calls bdrv_drain_all,
which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs
that doubles the speed and offset is doubled.
Some jobs' status are changed as well.

The fix is to not resume the jobs that are already yielded and also change
185.out accordingly.

Suggested-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
---
 blockjob.c                 | 10 +++++++++-
 include/block/blockjob.h   |  5 +++++
 tests/qemu-iotests/185.out | 11 +++++++++--
 3 files changed, 23 insertions(+), 3 deletions(-)

Comments

Stefan Hajnoczi March 23, 2018, 10:04 a.m. UTC | #1
On Fri, Mar 23, 2018 at 3:43 AM, QingFeng Hao <haoqf@linux.vnet.ibm.com> wrote:
> Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()".
> It's because of the newly introduced function vm_shutdown calls bdrv_drain_all,
> which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs
> that doubles the speed and offset is doubled.
> Some jobs' status are changed as well.
>
> The fix is to not resume the jobs that are already yielded and also change
> 185.out accordingly.
>
> Suggested-by: Stefan Hajnoczi <stefanha@gmail.com>
> Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
> ---
>  blockjob.c                 | 10 +++++++++-
>  include/block/blockjob.h   |  5 +++++
>  tests/qemu-iotests/185.out | 11 +++++++++--

If drain no longer forces the block job to iterate, shouldn't the test
output remain the same?  (The means the test is fixed by the QEMU
patch.)

>  3 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/blockjob.c b/blockjob.c
> index ef3ed69ff1..fa9838ac97 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -206,11 +206,16 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
>
>  static void block_job_pause(BlockJob *job)
>  {
> -    job->pause_count++;
> +    if (!job->yielded) {
> +        job->pause_count++;
> +    }

The pause cannot be ignored.  This change introduces a bug.

Pause is not a synchronous operation that stops the job immediately.
Pause just remembers that the job needs to be paused.   When the job
runs again (e.g. timer callback, fd handler) it eventually reaches
block_job_pause_point() where it really pauses.

The bug in this patch is:

1. The job has a timer pending.
2. block_job_pause() is called during drain.
3. The timer fires during drain but now the job doesn't know it needs
to pause, so it continues running!

Instead what needs to happen is that block_job_pause() remains
unmodified but block_job_resume if extended:

static void block_job_resume(BlockJob *job)
{
    assert(job->pause_count > 0);
    job->pause_count--;
    if (job->pause_count) {
        return;
    }
+    if (job_yielded_before_pause_and_is_still_yielded) {
    block_job_enter(job);
+    }
}

This handles the case I mentioned above, where the yield ends before
pause ends (therefore resume must enter the job!).

To make this a little clearer, there are two cases to consider:

Case 1:
1. Job yields
2. Pause
3. Job is entered from timer/fd callback
4. Resume (enter job? yes)

Case 2:
1. Job yields
2. Pause
3. Resume (enter job? no)
4. Job is entered from timer/fd callback

Stefan
Kevin Wolf March 26, 2018, 10:29 a.m. UTC | #2
Am 23.03.2018 um 04:43 hat QingFeng Hao geschrieben:
> Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()".
> It's because of the newly introduced function vm_shutdown calls bdrv_drain_all,
> which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs
> that doubles the speed and offset is doubled.
> Some jobs' status are changed as well.
> 
> The fix is to not resume the jobs that are already yielded and also change
> 185.out accordingly.
> 
> Suggested-by: Stefan Hajnoczi <stefanha@gmail.com>
> Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>

Stefan already commented on the fix itself, but I want to add two more
points:

Please change your subject line. "iotests: fix test case 185" means that
you are fixing the test case, not qemu code that makes the test case
fail. The subject line should describe the actual change. In all
likelihood it will start with "blockjob:" rather than "iotests:".

> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
> index fc645dac68..f8f208bbcf 100644
> --- a/include/block/blockjob.h
> +++ b/include/block/blockjob.h
> @@ -99,6 +99,11 @@ typedef struct BlockJob {
>      bool ready;
>  
>      /**
> +     * Set to true when the job is yielded.
> +     */
> +    bool yielded;

This is the same as !busy, so we don't need a new field for this.

Kevin
Hao QingFeng March 27, 2018, 3:32 a.m. UTC | #3
在 2018/3/23 18:04, Stefan Hajnoczi 写道:
> On Fri, Mar 23, 2018 at 3:43 AM, QingFeng Hao <haoqf@linux.vnet.ibm.com> wrote:
>> Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()".
>> It's because of the newly introduced function vm_shutdown calls bdrv_drain_all,
>> which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs
>> that doubles the speed and offset is doubled.
>> Some jobs' status are changed as well.
>>
>> The fix is to not resume the jobs that are already yielded and also change
>> 185.out accordingly.
>>
>> Suggested-by: Stefan Hajnoczi <stefanha@gmail.com>
>> Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
>> ---
>>   blockjob.c                 | 10 +++++++++-
>>   include/block/blockjob.h   |  5 +++++
>>   tests/qemu-iotests/185.out | 11 +++++++++--
> 
> If drain no longer forces the block job to iterate, shouldn't the test
> output remain the same?  (The means the test is fixed by the QEMU
> patch.)
> 
>>   3 files changed, 23 insertions(+), 3 deletions(-)
>>
>> diff --git a/blockjob.c b/blockjob.c
>> index ef3ed69ff1..fa9838ac97 100644
>> --- a/blockjob.c
>> +++ b/blockjob.c
>> @@ -206,11 +206,16 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
>>
>>   static void block_job_pause(BlockJob *job)
>>   {
>> -    job->pause_count++;
>> +    if (!job->yielded) {
>> +        job->pause_count++;
>> +    }
> 
> The pause cannot be ignored.  This change introduces a bug.
> 
> Pause is not a synchronous operation that stops the job immediately.
> Pause just remembers that the job needs to be paused.   When the job
> runs again (e.g. timer callback, fd handler) it eventually reaches
> block_job_pause_point() where it really pauses.
> 
> The bug in this patch is:
> 
> 1. The job has a timer pending.
> 2. block_job_pause() is called during drain.
> 3. The timer fires during drain but now the job doesn't know it needs
> to pause, so it continues running!
> 
> Instead what needs to happen is that block_job_pause() remains
> unmodified but block_job_resume if extended:
> 
> static void block_job_resume(BlockJob *job)
> {
>      assert(job->pause_count > 0);
>      job->pause_count--;
>      if (job->pause_count) {
>          return;
>      }
> +    if (job_yielded_before_pause_and_is_still_yielded) {
Thanks a lot for your great comments! But I can't figure out how to 
check this.
>      block_job_enter(job);
> +    }
> }
> 
> This handles the case I mentioned above, where the yield ends before
> pause ends (therefore resume must enter the job!).
> 
> To make this a little clearer, there are two cases to consider:
> 
> Case 1:
> 1. Job yields
> 2. Pause
> 3. Job is entered from timer/fd callback
How do I know that if job is entered from ...? thanks
> 4. Resume (enter job? yes)
> 
> Case 2:
> 1. Job yields
> 2. Pause
> 3. Resume (enter job? no)
> 4. Job is entered from timer/fd callback
> 
> Stefan
>
Hao QingFeng March 27, 2018, 3:33 a.m. UTC | #4
在 2018/3/26 18:29, Kevin Wolf 写道:
> Am 23.03.2018 um 04:43 hat QingFeng Hao geschrieben:
>> Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()".
>> It's because of the newly introduced function vm_shutdown calls bdrv_drain_all,
>> which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs
>> that doubles the speed and offset is doubled.
>> Some jobs' status are changed as well.
>>
>> The fix is to not resume the jobs that are already yielded and also change
>> 185.out accordingly.
>>
>> Suggested-by: Stefan Hajnoczi <stefanha@gmail.com>
>> Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
> 
> Stefan already commented on the fix itself, but I want to add two more
> points:
> 
> Please change your subject line. "iotests: fix test case 185" means that
> you are fixing the test case, not qemu code that makes the test case
> fail. The subject line should describe the actual change. In all
> likelihood it will start with "blockjob:" rather than "iotests:".
Sure! thanks for pointing that.
> 
>> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
>> index fc645dac68..f8f208bbcf 100644
>> --- a/include/block/blockjob.h
>> +++ b/include/block/blockjob.h
>> @@ -99,6 +99,11 @@ typedef struct BlockJob {
>>       bool ready;
>>   
>>       /**
>> +     * Set to true when the job is yielded.
>> +     */
>> +    bool yielded;
> 
> This is the same as !busy, so we don't need a new field for this.
> 
Mostly yes, but the trick is that busy is set to be true in 
block_job_do_yield.
> Kevin
>
Stefan Hajnoczi April 3, 2018, 2:20 p.m. UTC | #5
On Tue, Mar 27, 2018 at 11:32:00AM +0800, QingFeng Hao wrote:
> 
> 在 2018/3/23 18:04, Stefan Hajnoczi 写道:
> > On Fri, Mar 23, 2018 at 3:43 AM, QingFeng Hao <haoqf@linux.vnet.ibm.com> wrote:
> > > Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()".
> > > It's because of the newly introduced function vm_shutdown calls bdrv_drain_all,
> > > which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs
> > > that doubles the speed and offset is doubled.
> > > Some jobs' status are changed as well.
> > > 
> > > The fix is to not resume the jobs that are already yielded and also change
> > > 185.out accordingly.
> > > 
> > > Suggested-by: Stefan Hajnoczi <stefanha@gmail.com>
> > > Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
> > > ---
> > >   blockjob.c                 | 10 +++++++++-
> > >   include/block/blockjob.h   |  5 +++++
> > >   tests/qemu-iotests/185.out | 11 +++++++++--
> > 
> > If drain no longer forces the block job to iterate, shouldn't the test
> > output remain the same?  (The means the test is fixed by the QEMU
> > patch.)
> > 
> > >   3 files changed, 23 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/blockjob.c b/blockjob.c
> > > index ef3ed69ff1..fa9838ac97 100644
> > > --- a/blockjob.c
> > > +++ b/blockjob.c
> > > @@ -206,11 +206,16 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
> > > 
> > >   static void block_job_pause(BlockJob *job)
> > >   {
> > > -    job->pause_count++;
> > > +    if (!job->yielded) {
> > > +        job->pause_count++;
> > > +    }
> > 
> > The pause cannot be ignored.  This change introduces a bug.
> > 
> > Pause is not a synchronous operation that stops the job immediately.
> > Pause just remembers that the job needs to be paused.   When the job
> > runs again (e.g. timer callback, fd handler) it eventually reaches
> > block_job_pause_point() where it really pauses.
> > 
> > The bug in this patch is:
> > 
> > 1. The job has a timer pending.
> > 2. block_job_pause() is called during drain.
> > 3. The timer fires during drain but now the job doesn't know it needs
> > to pause, so it continues running!
> > 
> > Instead what needs to happen is that block_job_pause() remains
> > unmodified but block_job_resume if extended:
> > 
> > static void block_job_resume(BlockJob *job)
> > {
> >      assert(job->pause_count > 0);
> >      job->pause_count--;
> >      if (job->pause_count) {
> >          return;
> >      }
> > +    if (job_yielded_before_pause_and_is_still_yielded) {
> Thanks a lot for your great comments! But I can't figure out how to check
> this.
> >      block_job_enter(job);
> > +    }
> > }
> > 
> > This handles the case I mentioned above, where the yield ends before
> > pause ends (therefore resume must enter the job!).
> > 
> > To make this a little clearer, there are two cases to consider:
> > 
> > Case 1:
> > 1. Job yields
> > 2. Pause
> > 3. Job is entered from timer/fd callback
> How do I know that if job is entered from ...? thanks

Sorry, in order to answer your question properly I would have to study
the code and get the point where I could write the patch myself.

I have sent a patch to update the test output for the upcoming QEMU 2.12
release.  At this time in the release cycle it's the most appropriate
solution.

Stefan
diff mbox

Patch

diff --git a/blockjob.c b/blockjob.c
index ef3ed69ff1..fa9838ac97 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -206,11 +206,16 @@  void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
 
 static void block_job_pause(BlockJob *job)
 {
-    job->pause_count++;
+    if (!job->yielded) {
+        job->pause_count++;
+    }
 }
 
 static void block_job_resume(BlockJob *job)
 {
+    if (job->yielded) {
+        return;
+    }
     assert(job->pause_count > 0);
     job->pause_count--;
     if (job->pause_count) {
@@ -371,6 +376,7 @@  static void block_job_sleep_timer_cb(void *opaque)
     BlockJob *job = opaque;
 
     block_job_enter(job);
+    job->yielded = false;
 }
 
 void block_job_start(BlockJob *job)
@@ -935,6 +941,7 @@  void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     job->cb            = cb;
     job->opaque        = opaque;
     job->busy          = false;
+    job->yielded       = false;
     job->paused        = true;
     job->pause_count   = 1;
     job->refcnt        = 1;
@@ -1034,6 +1041,7 @@  static void block_job_do_yield(BlockJob *job, uint64_t ns)
         timer_mod(&job->sleep_timer, ns);
     }
     job->busy = false;
+    job->yielded = true;
     block_job_unlock();
     qemu_coroutine_yield();
 
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index fc645dac68..f8f208bbcf 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -99,6 +99,11 @@  typedef struct BlockJob {
     bool ready;
 
     /**
+     * Set to true when the job is yielded.
+     */
+    bool yielded;
+
+    /**
      * Set to true when the job has deferred work to the main loop.
      */
     bool deferred_to_main_loop;
diff --git a/tests/qemu-iotests/185.out b/tests/qemu-iotests/185.out
index 57eaf8d699..798282e196 100644
--- a/tests/qemu-iotests/185.out
+++ b/tests/qemu-iotests/185.out
@@ -7,6 +7,7 @@  Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
 
 === Creating backing chain ===
 
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "RESUME"}
 Formatting 'TEST_DIR/t.qcow2.mid', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.base backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
 {"return": {}}
 wrote 4194304/4194304 bytes at offset 0
@@ -25,23 +26,28 @@  Formatting 'TEST_DIR/t.qcow2', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.q
 === Start active commit job and exit qemu ===
 
 {"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "RESUME"}
 {"return": {}}
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_READY", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}}
 
 === Start mirror job and exit qemu ===
 
 {"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "RESUME"}
 Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 cluster_size=65536 lazy_refcounts=off refcount_bits=16
 {"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_READY", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}}
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}}
 
 === Start backup job and exit qemu ===
 
 {"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "RESUME"}
 Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 cluster_size=65536 lazy_refcounts=off refcount_bits=16
 {"return": {}}
 {"return": {}}
@@ -51,6 +57,7 @@  Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 cluster_size=65536 l
 === Start streaming job and exit qemu ===
 
 {"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "RESUME"}
 {"return": {}}
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}