Message ID | 20180323034356.72130-2-haoqf@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Mar 23, 2018 at 3:43 AM, QingFeng Hao <haoqf@linux.vnet.ibm.com> wrote: > Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()". > It's because of the newly introduced function vm_shutdown calls bdrv_drain_all, > which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs > that doubles the speed and offset is doubled. > Some jobs' status are changed as well. > > The fix is to not resume the jobs that are already yielded and also change > 185.out accordingly. > > Suggested-by: Stefan Hajnoczi <stefanha@gmail.com> > Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com> > --- > blockjob.c | 10 +++++++++- > include/block/blockjob.h | 5 +++++ > tests/qemu-iotests/185.out | 11 +++++++++-- If drain no longer forces the block job to iterate, shouldn't the test output remain the same? (The means the test is fixed by the QEMU patch.) > 3 files changed, 23 insertions(+), 3 deletions(-) > > diff --git a/blockjob.c b/blockjob.c > index ef3ed69ff1..fa9838ac97 100644 > --- a/blockjob.c > +++ b/blockjob.c > @@ -206,11 +206,16 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job) > > static void block_job_pause(BlockJob *job) > { > - job->pause_count++; > + if (!job->yielded) { > + job->pause_count++; > + } The pause cannot be ignored. This change introduces a bug. Pause is not a synchronous operation that stops the job immediately. Pause just remembers that the job needs to be paused. When the job runs again (e.g. timer callback, fd handler) it eventually reaches block_job_pause_point() where it really pauses. The bug in this patch is: 1. The job has a timer pending. 2. block_job_pause() is called during drain. 3. The timer fires during drain but now the job doesn't know it needs to pause, so it continues running! Instead what needs to happen is that block_job_pause() remains unmodified but block_job_resume if extended: static void block_job_resume(BlockJob *job) { assert(job->pause_count > 0); job->pause_count--; if (job->pause_count) { return; } + if (job_yielded_before_pause_and_is_still_yielded) { block_job_enter(job); + } } This handles the case I mentioned above, where the yield ends before pause ends (therefore resume must enter the job!). To make this a little clearer, there are two cases to consider: Case 1: 1. Job yields 2. Pause 3. Job is entered from timer/fd callback 4. Resume (enter job? yes) Case 2: 1. Job yields 2. Pause 3. Resume (enter job? no) 4. Job is entered from timer/fd callback Stefan
Am 23.03.2018 um 04:43 hat QingFeng Hao geschrieben: > Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()". > It's because of the newly introduced function vm_shutdown calls bdrv_drain_all, > which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs > that doubles the speed and offset is doubled. > Some jobs' status are changed as well. > > The fix is to not resume the jobs that are already yielded and also change > 185.out accordingly. > > Suggested-by: Stefan Hajnoczi <stefanha@gmail.com> > Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com> Stefan already commented on the fix itself, but I want to add two more points: Please change your subject line. "iotests: fix test case 185" means that you are fixing the test case, not qemu code that makes the test case fail. The subject line should describe the actual change. In all likelihood it will start with "blockjob:" rather than "iotests:". > diff --git a/include/block/blockjob.h b/include/block/blockjob.h > index fc645dac68..f8f208bbcf 100644 > --- a/include/block/blockjob.h > +++ b/include/block/blockjob.h > @@ -99,6 +99,11 @@ typedef struct BlockJob { > bool ready; > > /** > + * Set to true when the job is yielded. > + */ > + bool yielded; This is the same as !busy, so we don't need a new field for this. Kevin
在 2018/3/23 18:04, Stefan Hajnoczi 写道: > On Fri, Mar 23, 2018 at 3:43 AM, QingFeng Hao <haoqf@linux.vnet.ibm.com> wrote: >> Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()". >> It's because of the newly introduced function vm_shutdown calls bdrv_drain_all, >> which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs >> that doubles the speed and offset is doubled. >> Some jobs' status are changed as well. >> >> The fix is to not resume the jobs that are already yielded and also change >> 185.out accordingly. >> >> Suggested-by: Stefan Hajnoczi <stefanha@gmail.com> >> Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com> >> --- >> blockjob.c | 10 +++++++++- >> include/block/blockjob.h | 5 +++++ >> tests/qemu-iotests/185.out | 11 +++++++++-- > > If drain no longer forces the block job to iterate, shouldn't the test > output remain the same? (The means the test is fixed by the QEMU > patch.) > >> 3 files changed, 23 insertions(+), 3 deletions(-) >> >> diff --git a/blockjob.c b/blockjob.c >> index ef3ed69ff1..fa9838ac97 100644 >> --- a/blockjob.c >> +++ b/blockjob.c >> @@ -206,11 +206,16 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job) >> >> static void block_job_pause(BlockJob *job) >> { >> - job->pause_count++; >> + if (!job->yielded) { >> + job->pause_count++; >> + } > > The pause cannot be ignored. This change introduces a bug. > > Pause is not a synchronous operation that stops the job immediately. > Pause just remembers that the job needs to be paused. When the job > runs again (e.g. timer callback, fd handler) it eventually reaches > block_job_pause_point() where it really pauses. > > The bug in this patch is: > > 1. The job has a timer pending. > 2. block_job_pause() is called during drain. > 3. The timer fires during drain but now the job doesn't know it needs > to pause, so it continues running! > > Instead what needs to happen is that block_job_pause() remains > unmodified but block_job_resume if extended: > > static void block_job_resume(BlockJob *job) > { > assert(job->pause_count > 0); > job->pause_count--; > if (job->pause_count) { > return; > } > + if (job_yielded_before_pause_and_is_still_yielded) { Thanks a lot for your great comments! But I can't figure out how to check this. > block_job_enter(job); > + } > } > > This handles the case I mentioned above, where the yield ends before > pause ends (therefore resume must enter the job!). > > To make this a little clearer, there are two cases to consider: > > Case 1: > 1. Job yields > 2. Pause > 3. Job is entered from timer/fd callback How do I know that if job is entered from ...? thanks > 4. Resume (enter job? yes) > > Case 2: > 1. Job yields > 2. Pause > 3. Resume (enter job? no) > 4. Job is entered from timer/fd callback > > Stefan >
在 2018/3/26 18:29, Kevin Wolf 写道: > Am 23.03.2018 um 04:43 hat QingFeng Hao geschrieben: >> Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()". >> It's because of the newly introduced function vm_shutdown calls bdrv_drain_all, >> which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs >> that doubles the speed and offset is doubled. >> Some jobs' status are changed as well. >> >> The fix is to not resume the jobs that are already yielded and also change >> 185.out accordingly. >> >> Suggested-by: Stefan Hajnoczi <stefanha@gmail.com> >> Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com> > > Stefan already commented on the fix itself, but I want to add two more > points: > > Please change your subject line. "iotests: fix test case 185" means that > you are fixing the test case, not qemu code that makes the test case > fail. The subject line should describe the actual change. In all > likelihood it will start with "blockjob:" rather than "iotests:". Sure! thanks for pointing that. > >> diff --git a/include/block/blockjob.h b/include/block/blockjob.h >> index fc645dac68..f8f208bbcf 100644 >> --- a/include/block/blockjob.h >> +++ b/include/block/blockjob.h >> @@ -99,6 +99,11 @@ typedef struct BlockJob { >> bool ready; >> >> /** >> + * Set to true when the job is yielded. >> + */ >> + bool yielded; > > This is the same as !busy, so we don't need a new field for this. > Mostly yes, but the trick is that busy is set to be true in block_job_do_yield. > Kevin >
On Tue, Mar 27, 2018 at 11:32:00AM +0800, QingFeng Hao wrote: > > 在 2018/3/23 18:04, Stefan Hajnoczi 写道: > > On Fri, Mar 23, 2018 at 3:43 AM, QingFeng Hao <haoqf@linux.vnet.ibm.com> wrote: > > > Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()". > > > It's because of the newly introduced function vm_shutdown calls bdrv_drain_all, > > > which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs > > > that doubles the speed and offset is doubled. > > > Some jobs' status are changed as well. > > > > > > The fix is to not resume the jobs that are already yielded and also change > > > 185.out accordingly. > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@gmail.com> > > > Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com> > > > --- > > > blockjob.c | 10 +++++++++- > > > include/block/blockjob.h | 5 +++++ > > > tests/qemu-iotests/185.out | 11 +++++++++-- > > > > If drain no longer forces the block job to iterate, shouldn't the test > > output remain the same? (The means the test is fixed by the QEMU > > patch.) > > > > > 3 files changed, 23 insertions(+), 3 deletions(-) > > > > > > diff --git a/blockjob.c b/blockjob.c > > > index ef3ed69ff1..fa9838ac97 100644 > > > --- a/blockjob.c > > > +++ b/blockjob.c > > > @@ -206,11 +206,16 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job) > > > > > > static void block_job_pause(BlockJob *job) > > > { > > > - job->pause_count++; > > > + if (!job->yielded) { > > > + job->pause_count++; > > > + } > > > > The pause cannot be ignored. This change introduces a bug. > > > > Pause is not a synchronous operation that stops the job immediately. > > Pause just remembers that the job needs to be paused. When the job > > runs again (e.g. timer callback, fd handler) it eventually reaches > > block_job_pause_point() where it really pauses. > > > > The bug in this patch is: > > > > 1. The job has a timer pending. > > 2. block_job_pause() is called during drain. > > 3. The timer fires during drain but now the job doesn't know it needs > > to pause, so it continues running! > > > > Instead what needs to happen is that block_job_pause() remains > > unmodified but block_job_resume if extended: > > > > static void block_job_resume(BlockJob *job) > > { > > assert(job->pause_count > 0); > > job->pause_count--; > > if (job->pause_count) { > > return; > > } > > + if (job_yielded_before_pause_and_is_still_yielded) { > Thanks a lot for your great comments! But I can't figure out how to check > this. > > block_job_enter(job); > > + } > > } > > > > This handles the case I mentioned above, where the yield ends before > > pause ends (therefore resume must enter the job!). > > > > To make this a little clearer, there are two cases to consider: > > > > Case 1: > > 1. Job yields > > 2. Pause > > 3. Job is entered from timer/fd callback > How do I know that if job is entered from ...? thanks Sorry, in order to answer your question properly I would have to study the code and get the point where I could write the patch myself. I have sent a patch to update the test output for the upcoming QEMU 2.12 release. At this time in the release cycle it's the most appropriate solution. Stefan
diff --git a/blockjob.c b/blockjob.c index ef3ed69ff1..fa9838ac97 100644 --- a/blockjob.c +++ b/blockjob.c @@ -206,11 +206,16 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job) static void block_job_pause(BlockJob *job) { - job->pause_count++; + if (!job->yielded) { + job->pause_count++; + } } static void block_job_resume(BlockJob *job) { + if (job->yielded) { + return; + } assert(job->pause_count > 0); job->pause_count--; if (job->pause_count) { @@ -371,6 +376,7 @@ static void block_job_sleep_timer_cb(void *opaque) BlockJob *job = opaque; block_job_enter(job); + job->yielded = false; } void block_job_start(BlockJob *job) @@ -935,6 +941,7 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver, job->cb = cb; job->opaque = opaque; job->busy = false; + job->yielded = false; job->paused = true; job->pause_count = 1; job->refcnt = 1; @@ -1034,6 +1041,7 @@ static void block_job_do_yield(BlockJob *job, uint64_t ns) timer_mod(&job->sleep_timer, ns); } job->busy = false; + job->yielded = true; block_job_unlock(); qemu_coroutine_yield(); diff --git a/include/block/blockjob.h b/include/block/blockjob.h index fc645dac68..f8f208bbcf 100644 --- a/include/block/blockjob.h +++ b/include/block/blockjob.h @@ -99,6 +99,11 @@ typedef struct BlockJob { bool ready; /** + * Set to true when the job is yielded. + */ + bool yielded; + + /** * Set to true when the job has deferred work to the main loop. */ bool deferred_to_main_loop; diff --git a/tests/qemu-iotests/185.out b/tests/qemu-iotests/185.out index 57eaf8d699..798282e196 100644 --- a/tests/qemu-iotests/185.out +++ b/tests/qemu-iotests/185.out @@ -7,6 +7,7 @@ Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 === Creating backing chain === +{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "RESUME"} Formatting 'TEST_DIR/t.qcow2.mid', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.qcow2.base backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16 {"return": {}} wrote 4194304/4194304 bytes at offset 0 @@ -25,23 +26,28 @@ Formatting 'TEST_DIR/t.qcow2', fmt=qcow2 size=67108864 backing_file=TEST_DIR/t.q === Start active commit job and exit qemu === {"return": {}} +{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "RESUME"} {"return": {}} {"return": {}} {"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}} -{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}} +{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "BLOCK_JOB_READY", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}} +{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "commit"}} === Start mirror job and exit qemu === {"return": {}} +{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "RESUME"} Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 cluster_size=65536 lazy_refcounts=off refcount_bits=16 {"return": {}} +{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "BLOCK_JOB_READY", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}} {"return": {}} {"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}} -{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}} +{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "disk", "len": 4194304, "offset": 4194304, "speed": 65536, "type": "mirror"}} === Start backup job and exit qemu === {"return": {}} +{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "RESUME"} Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 cluster_size=65536 lazy_refcounts=off refcount_bits=16 {"return": {}} {"return": {}} @@ -51,6 +57,7 @@ Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 cluster_size=65536 l === Start streaming job and exit qemu === {"return": {}} +{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "RESUME"} {"return": {}} {"return": {}} {"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false}}
Test case 185 failed since commit 4486e89c219 --- "vl: introduce vm_shutdown()". It's because of the newly introduced function vm_shutdown calls bdrv_drain_all, which is called later by bdrv_close_all. bdrv_drain_all resumes the jobs that doubles the speed and offset is doubled. Some jobs' status are changed as well. The fix is to not resume the jobs that are already yielded and also change 185.out accordingly. Suggested-by: Stefan Hajnoczi <stefanha@gmail.com> Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com> --- blockjob.c | 10 +++++++++- include/block/blockjob.h | 5 +++++ tests/qemu-iotests/185.out | 11 +++++++++-- 3 files changed, 23 insertions(+), 3 deletions(-)