Message ID | 6e00b90e-5c95-8b02-23c2-0acfe9862f6a@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iotest 129 | expand |
12.01.2021 20:44, Max Reitz wrote: > Hi, > > tl;dr: I have some troubles debugging what’s wrong with iotest 129. It wants to check that 'stop' does not drain a block job, but to me it seems like that’s exactly what’s happening with the mirror job. > > > For quite some time, I’ve had 129 disabled in my test branch because it fails all the time for me. Now it came up again in Vladimir’s async backup series, and so I tried my hands at debugging it once again. > > Recap: 129 writes 128M to some image, then runs a block job from there (while the source drive is supposedly throttled), then issues a stop command, and checks that the job is not drained. I.e., still running. > > (It checks the “running” state via @busy, which is probably wrong; it should verify that @state == 'running' (which wasn’t available back in 2015), but that’s not really why I’m writing this mail.) > > Like the last time I tried > (https://lists.nongnu.org/archive/html/qemu-block/2019-06/msg00499.html) I can see that block jobs completely ignore BB throttling: If you apply the attachment show-progress.patch, you’ll probably see some progress made while the test waits for five seconds. (Here, on tmpfs, mirror and commit get to READY, and backup completes.) > > OK, so now that block jobs don’t completely break with filters, you can instead use a filter node on the source (as I tried in the patch referenced above). And to fully fix the test, we’d also replace the @busy == True check by @status == 'running'. That’s the attachment filter-node-show-progress.patch. > > If I apply that, I can see that now there actually is some throttling. After the time.sleep(5), mirror and commit get to exactly 1 MB, and backup gets to 1.0625 MB. Good. > > However, after the stop is issued, backup stays at 1.2 MB (good), but mirror and commit progress obviously without throttling to 30, 40, 50 MB, whatever. So it appears to me they are drained by the stop. I.e., precisely what the iotest is trying to prove not to happen. I don't follow.. Increasing of progress means that jobs are drained? > > > I plan to continue investigating, but I just wanted to send this mail to see whether perhaps someone has an idea on what’s going on. > > (My main problem is that bisecting this is hard. AFAIK the throttling applied in master:129 has been broken ever since blockdev throttling was moved to the BB. Even without the “Deal with filters” series, you can use at least mirror sync=full from filter nodes, so I tried to use filter-node-show-progress.patch for just test_drive_mirror(), but that only works back until fe4f0614ef9e361d (or rather 0f12264e7a4145 if you prefer not to get a SIGABRT). Before that commit, it hangs.) > > Max Hmm, in show-progress.patch you call "stop" the second time.. It's a mistake I think.. Also, on current master x-bps-total I can find only in iotests.. Where is it defined o_O? If I change it to be bps-total, it fails.. Strange. I've run the test with your patch with throttling filter (and a bit more logging).. Interesting. It looks like throttling just don't work noramlly after stop.. I see that mirror does one 1M request, and it obviously overflow throttle limit, so during your next 5 seconds it does nothing (and we see progress of 1M). But immediately after "stop" command all 16 read requests pending for throttling goes, and then a lot more read requests (hmm, exactly 31) are issued and not throttled at all (but goes through throttle node). And then new 16 requests are issued and throttled. I've looked at throttle_group_co_io_limits_intercept() and I just don't understand how it works)
On 13.01.21 10:53, Vladimir Sementsov-Ogievskiy wrote: > 12.01.2021 20:44, Max Reitz wrote: >> Hi, >> >> tl;dr: I have some troubles debugging what’s wrong with iotest 129. >> It wants to check that 'stop' does not drain a block job, but to me it >> seems like that’s exactly what’s happening with the mirror job. >> >> >> For quite some time, I’ve had 129 disabled in my test branch because >> it fails all the time for me. Now it came up again in Vladimir’s >> async backup series, and so I tried my hands at debugging it once again. >> >> Recap: 129 writes 128M to some image, then runs a block job from there >> (while the source drive is supposedly throttled), then issues a stop >> command, and checks that the job is not drained. I.e., still running. >> >> (It checks the “running” state via @busy, which is probably wrong; it >> should verify that @state == 'running' (which wasn’t available back in >> 2015), but that’s not really why I’m writing this mail.) >> >> Like the last time I tried >> (https://lists.nongnu.org/archive/html/qemu-block/2019-06/msg00499.html) >> I can see that block jobs completely ignore BB throttling: If you >> apply the attachment show-progress.patch, you’ll probably see some >> progress made while the test waits for five seconds. (Here, on tmpfs, >> mirror and commit get to READY, and backup completes.) >> >> OK, so now that block jobs don’t completely break with filters, you >> can instead use a filter node on the source (as I tried in the patch >> referenced above). And to fully fix the test, we’d also replace the >> @busy == True check by @status == 'running'. That’s the attachment >> filter-node-show-progress.patch. >> >> If I apply that, I can see that now there actually is some throttling. >> After the time.sleep(5), mirror and commit get to exactly 1 MB, and >> backup gets to 1.0625 MB. Good. >> >> However, after the stop is issued, backup stays at 1.2 MB (good), but >> mirror and commit progress obviously without throttling to 30, 40, 50 >> MB, whatever. So it appears to me they are drained by the stop. >> I.e., precisely what the iotest is trying to prove not to happen. > > I don't follow.. Increasing of progress means that jobs are drained? I don’t know. It does mean that throttling is ignored for a bit, though. I could imagine that’s because something is being drained. >> I plan to continue investigating, but I just wanted to send this mail >> to see whether perhaps someone has an idea on what’s going on. >> >> (My main problem is that bisecting this is hard. AFAIK the throttling >> applied in master:129 has been broken ever since blockdev throttling >> was moved to the BB. Even without the “Deal with filters” series, you >> can use at least mirror sync=full from filter nodes, so I tried to use >> filter-node-show-progress.patch for just test_drive_mirror(), but that >> only works back until fe4f0614ef9e361d (or rather 0f12264e7a4145 if >> you prefer not to get a SIGABRT). Before that commit, it hangs.) >> >> Max > > > Hmm, in show-progress.patch you call "stop" the second time.. It's a > mistake I think.. Ah, oops. Yes, not sure, how that part got in (some rebasing mistake). Still, removing those three duplicated lines (stop + query-block-jobs) yields the same result. (I mean, what else is to be expected; BB throttling does nothing, so even before the first stop, the jobs are READY/COMPLETED.) > Also, on current master x-bps-total I can find only in iotests.. Where > is it defined o_O? If I change it to be bps-total, it fails.. Strange. block/throttle-groups.c defines x- as a THROTTLE_OPT_PREFIX... :/ > I've run the test with your patch with throttling filter (and a bit more > logging).. Interesting. It looks like throttling just don't work > noramlly after stop.. I see that mirror does one 1M request, and it > obviously overflow throttle limit, so during your next 5 seconds it does > nothing (and we see progress of 1M). But immediately after "stop" > command all 16 read requests pending for throttling goes, and then a lot > more read requests (hmm, exactly 31) are issued and not throttled at all > (but goes through throttle node). And then new 16 requests are issued > and throttled. I've looked at throttle_group_co_io_limits_intercept() > and I just don't understand how it works) Hm. So you’re saying only the current set of requests are drained, but no new ones are generated? Perhaps 129 was introduced to check that block jobs don’t run to completion on 'stop'. The commit before it (e62303a437af72141^) makes block jobs pause in bdrv_drain_all(), so they don’t generate more requests. Perhaps we just need to ensure that mirror won’t generate many concurrent requests. Indeed. Setting buf_size=65536 leads to offset reaching 64k after the sleep, and then 128k after the stop. That makes sense now. Now there’s only one problem: That doesn’t work with commit… Then again, the commit 129 uses is an active commit, i.e. just mirror. It looks like we can translate it into a non-active commit, though then we still have no control over the buffer size. But then it only progresses to 2.5 MB, which I guess is fine... I suppose with your async backup series, we should then limit max-workers and max-chunk to the minimum for the backup job? Max
13.01.2021 14:05, Max Reitz wrote: > On 13.01.21 10:53, Vladimir Sementsov-Ogievskiy wrote: >> 12.01.2021 20:44, Max Reitz wrote: >>> Hi, >>> >>> tl;dr: I have some troubles debugging what’s wrong with iotest 129. It wants to check that 'stop' does not drain a block job, but to me it seems like that’s exactly what’s happening with the mirror job. >>> >>> >>> For quite some time, I’ve had 129 disabled in my test branch because it fails all the time for me. Now it came up again in Vladimir’s async backup series, and so I tried my hands at debugging it once again. >>> >>> Recap: 129 writes 128M to some image, then runs a block job from there (while the source drive is supposedly throttled), then issues a stop command, and checks that the job is not drained. I.e., still running. >>> >>> (It checks the “running” state via @busy, which is probably wrong; it should verify that @state == 'running' (which wasn’t available back in 2015), but that’s not really why I’m writing this mail.) >>> >>> Like the last time I tried >>> (https://lists.nongnu.org/archive/html/qemu-block/2019-06/msg00499.html) I can see that block jobs completely ignore BB throttling: If you apply the attachment show-progress.patch, you’ll probably see some progress made while the test waits for five seconds. (Here, on tmpfs, mirror and commit get to READY, and backup completes.) >>> >>> OK, so now that block jobs don’t completely break with filters, you can instead use a filter node on the source (as I tried in the patch referenced above). And to fully fix the test, we’d also replace the @busy == True check by @status == 'running'. That’s the attachment filter-node-show-progress.patch. >>> >>> If I apply that, I can see that now there actually is some throttling. After the time.sleep(5), mirror and commit get to exactly 1 MB, and backup gets to 1.0625 MB. Good. >>> >>> However, after the stop is issued, backup stays at 1.2 MB (good), but mirror and commit progress obviously without throttling to 30, 40, 50 MB, whatever. So it appears to me they are drained by the stop. I.e., precisely what the iotest is trying to prove not to happen. >> >> I don't follow.. Increasing of progress means that jobs are drained? > > I don’t know. It does mean that throttling is ignored for a bit, though. I could imagine that’s because something is being drained. > >>> I plan to continue investigating, but I just wanted to send this mail to see whether perhaps someone has an idea on what’s going on. >>> >>> (My main problem is that bisecting this is hard. AFAIK the throttling applied in master:129 has been broken ever since blockdev throttling was moved to the BB. Even without the “Deal with filters” series, you can use at least mirror sync=full from filter nodes, so I tried to use filter-node-show-progress.patch for just test_drive_mirror(), but that only works back until fe4f0614ef9e361d (or rather 0f12264e7a4145 if you prefer not to get a SIGABRT). Before that commit, it hangs.) >>> >>> Max >> >> >> Hmm, in show-progress.patch you call "stop" the second time.. It's a mistake I think.. > > Ah, oops. Yes, not sure, how that part got in (some rebasing mistake). > > Still, removing those three duplicated lines (stop + query-block-jobs) yields the same result. (I mean, what else is to be expected; BB throttling does nothing, so even before the first stop, the jobs are READY/COMPLETED.) > >> Also, on current master x-bps-total I can find only in iotests.. Where is it defined o_O? If I change it to be bps-total, it fails.. Strange. > > block/throttle-groups.c defines x- as a THROTTLE_OPT_PREFIX... :/ > >> I've run the test with your patch with throttling filter (and a bit more logging).. Interesting. It looks like throttling just don't work noramlly after stop.. I see that mirror does one 1M request, and it obviously overflow throttle limit, so during your next 5 seconds it does nothing (and we see progress of 1M). But immediately after "stop" command all 16 read requests pending for throttling goes, and then a lot more read requests (hmm, exactly 31) are issued and not throttled at all (but goes through throttle node). And then new 16 requests are issued and throttled. I've looked at throttle_group_co_io_limits_intercept() and I just don't understand how it works) > > Hm. So you’re saying only the current set of requests are drained, but no new ones are generated? hmm, what's not generated? New requests are generated.. But some requests are not throttled and I don't understand why. > > Perhaps 129 was introduced to check that block jobs don’t run to completion on 'stop'. The commit before it (e62303a437af72141^) makes block jobs pause in bdrv_drain_all(), so they don’t generate more requests. Perhaps we just need to ensure that mirror won’t generate many concurrent requests. But bdrv_drain_all() is not a "drain_begin", so after it jobs are resumed and new requests may be generated.. > > Indeed. Setting buf_size=65536 leads to offset reaching 64k after the sleep, and then 128k after the stop. That makes sense now. > > Now there’s only one problem: That doesn’t work with commit… > > Then again, the commit 129 uses is an active commit, i.e. just mirror. It looks like we can translate it into a non-active commit, though then we still have no control over the buffer size. But then it only progresses to 2.5 MB, which I guess is fine... > > I suppose with your async backup series, we should then limit max-workers and max-chunk to the minimum for the backup job? > I have a patch in the series which sets the backup speed to 1k and it is enough. Let me summarize what I think about that all: Jobs should continue running and do progress after "stop". So, test should check exactly this behavior. We also want to check that "stop" doesn't force jobs to finish synchronously (but we just doesn't have a synchronous finialization of job mechanism, so I think it's not possible anyway). Test check that jobs are "busy" after stop, which is bad idea as we know. I think, test should do the following: 1. check that after stop progress is less than maximum (which guarantees both that job was not finished prior to "stop", and that "stop" doesn't force synchronous finalization) 1.5 check also that job is in "running" state (not cancelled, for exmaple) 2. then, drop any throttling 3. wait for correct job finish So we check success scenario, which is the main thing. We can still just cancel the job instead of [2,3], like test is already do. Than about throttling: Why we need it? Because without throttling jobs may do their work too fast, and just finish prior to "stop" command. And we see, that block_set_io_throttle definied throttling doesn't work good with block-jobs. Throttling filter works bad as well. We can use "speed" of jobs instead, to throttle exactly block-jobs. And it must work. I'm sure at least, that for current (synchronous) backup, it's "speed" works good, and after my series "speed" in new async backup is smart enough and works good too. So, there is my patch "[PATCH v3 13/25] iotests: 129: prepare for backup over block-copy". So, I can suggest to drop any throttling from the test (as it's not test of throttling) and debug and test throttling in separate. And instead, just use "speed" parameter for all the jobs (like a lot of other tests do).
13.01.2021 16:59, Vladimir Sementsov-Ogievskiy wrote: > 13.01.2021 14:05, Max Reitz wrote: >> On 13.01.21 10:53, Vladimir Sementsov-Ogievskiy wrote: >>> 12.01.2021 20:44, Max Reitz wrote: >>>> Hi, >>>> >>>> tl;dr: I have some troubles debugging what’s wrong with iotest 129. It wants to check that 'stop' does not drain a block job, but to me it seems like that’s exactly what’s happening with the mirror job. >>>> >>>> >>>> For quite some time, I’ve had 129 disabled in my test branch because it fails all the time for me. Now it came up again in Vladimir’s async backup series, and so I tried my hands at debugging it once again. >>>> >>>> Recap: 129 writes 128M to some image, then runs a block job from there (while the source drive is supposedly throttled), then issues a stop command, and checks that the job is not drained. I.e., still running. >>>> >>>> (It checks the “running” state via @busy, which is probably wrong; it should verify that @state == 'running' (which wasn’t available back in 2015), but that’s not really why I’m writing this mail.) >>>> >>>> Like the last time I tried >>>> (https://lists.nongnu.org/archive/html/qemu-block/2019-06/msg00499.html) I can see that block jobs completely ignore BB throttling: If you apply the attachment show-progress.patch, you’ll probably see some progress made while the test waits for five seconds. (Here, on tmpfs, mirror and commit get to READY, and backup completes.) >>>> >>>> OK, so now that block jobs don’t completely break with filters, you can instead use a filter node on the source (as I tried in the patch referenced above). And to fully fix the test, we’d also replace the @busy == True check by @status == 'running'. That’s the attachment filter-node-show-progress.patch. >>>> >>>> If I apply that, I can see that now there actually is some throttling. After the time.sleep(5), mirror and commit get to exactly 1 MB, and backup gets to 1.0625 MB. Good. >>>> >>>> However, after the stop is issued, backup stays at 1.2 MB (good), but mirror and commit progress obviously without throttling to 30, 40, 50 MB, whatever. So it appears to me they are drained by the stop. I.e., precisely what the iotest is trying to prove not to happen. >>> >>> I don't follow.. Increasing of progress means that jobs are drained? >> >> I don’t know. It does mean that throttling is ignored for a bit, though. I could imagine that’s because something is being drained. >> >>>> I plan to continue investigating, but I just wanted to send this mail to see whether perhaps someone has an idea on what’s going on. >>>> >>>> (My main problem is that bisecting this is hard. AFAIK the throttling applied in master:129 has been broken ever since blockdev throttling was moved to the BB. Even without the “Deal with filters” series, you can use at least mirror sync=full from filter nodes, so I tried to use filter-node-show-progress.patch for just test_drive_mirror(), but that only works back until fe4f0614ef9e361d (or rather 0f12264e7a4145 if you prefer not to get a SIGABRT). Before that commit, it hangs.) >>>> >>>> Max >>> >>> >>> Hmm, in show-progress.patch you call "stop" the second time.. It's a mistake I think.. >> >> Ah, oops. Yes, not sure, how that part got in (some rebasing mistake). >> >> Still, removing those three duplicated lines (stop + query-block-jobs) yields the same result. (I mean, what else is to be expected; BB throttling does nothing, so even before the first stop, the jobs are READY/COMPLETED.) >> >>> Also, on current master x-bps-total I can find only in iotests.. Where is it defined o_O? If I change it to be bps-total, it fails.. Strange. >> >> block/throttle-groups.c defines x- as a THROTTLE_OPT_PREFIX... :/ >> >>> I've run the test with your patch with throttling filter (and a bit more logging).. Interesting. It looks like throttling just don't work noramlly after stop.. I see that mirror does one 1M request, and it obviously overflow throttle limit, so during your next 5 seconds it does nothing (and we see progress of 1M). But immediately after "stop" command all 16 read requests pending for throttling goes, and then a lot more read requests (hmm, exactly 31) are issued and not throttled at all (but goes through throttle node). And then new 16 requests are issued and throttled. I've looked at throttle_group_co_io_limits_intercept() and I just don't understand how it works) >> >> Hm. So you’re saying only the current set of requests are drained, but no new ones are generated? > > hmm, what's not generated? New requests are generated.. But some requests are not throttled and I don't understand why. > >> >> Perhaps 129 was introduced to check that block jobs don’t run to completion on 'stop'. The commit before it (e62303a437af72141^) makes block jobs pause in bdrv_drain_all(), so they don’t generate more requests. Perhaps we just need to ensure that mirror won’t generate many concurrent requests. > > But bdrv_drain_all() is not a "drain_begin", so after it jobs are resumed and new requests may be generated.. > >> >> Indeed. Setting buf_size=65536 leads to offset reaching 64k after the sleep, and then 128k after the stop. That makes sense now. >> >> Now there’s only one problem: That doesn’t work with commit… >> >> Then again, the commit 129 uses is an active commit, i.e. just mirror. It looks like we can translate it into a non-active commit, though then we still have no control over the buffer size. But then it only progresses to 2.5 MB, which I guess is fine... >> >> I suppose with your async backup series, we should then limit max-workers and max-chunk to the minimum for the backup job? >> > > I have a patch in the series which sets the backup speed to 1k and it is enough. > > Let me summarize what I think about that all: > > Jobs should continue running and do progress after "stop". So, test should check exactly this behavior. We also want to check that "stop" doesn't force jobs to finish synchronously (but we just doesn't have a synchronous finialization of job mechanism, so I think it's not possible anyway). > > Test check that jobs are "busy" after stop, which is bad idea as we know. I think, test should do the following: > > 1. check that after stop progress is less than maximum (which guarantees both that job was not finished prior to "stop", and that "stop" doesn't force synchronous finalization) > 1.5 check also that job is in "running" state (not cancelled, for exmaple) > 2. then, drop any throttling > 3. wait for correct job finish > > So we check success scenario, which is the main thing. We can still just cancel the job instead of [2,3], like test is already do. Hmm, cancelling is bad idea, as in this way we can miss the bug, when job after "stop" is in running state but does no progress forever. So, if we want to cancel, we at least should check progress twice to be sure that job continues to work. > > Than about throttling: > > Why we need it? Because without throttling jobs may do their work too fast, and just finish prior to "stop" command. > > And we see, that block_set_io_throttle definied throttling doesn't work good with block-jobs. Throttling filter works bad as well. > > We can use "speed" of jobs instead, to throttle exactly block-jobs. And it must work. I'm sure at least, that for current (synchronous) backup, it's "speed" works good, and after my series "speed" in new async backup is smart enough and works good too. So, there is my patch "[PATCH v3 13/25] iotests: 129: prepare for backup over block-copy". > > So, I can suggest to drop any throttling from the test (as it's not test of throttling) and debug and test throttling in separate. And instead, just use "speed" parameter for all the jobs (like a lot of other tests do). >
commit 5b09bd175481925cd60dfa959b3f1f7b7ba4a19c Author: Max Reitz <mreitz@redhat.com> Date: Mon Jan 28 20:41:16 2019 +0100 iotests/129: Use filter node, and show progress diff --git a/tests/qemu-iotests/129 b/tests/qemu-iotests/129 index 0e13244d85..eae77b9da9 100755 --- a/tests/qemu-iotests/129 +++ b/tests/qemu-iotests/129 @@ -32,52 +32,46 @@ class TestStopWithBlockJob(iotests.QMPTestCase): iotests.qemu_img('create', '-f', iotests.imgfmt, self.test_img, "-b", self.base_img, '-F', iotests.imgfmt) iotests.qemu_io('-f', iotests.imgfmt, '-c', 'write -P0x5d 1M 128M', self.test_img) - self.vm = iotests.VM().add_drive(self.test_img) + self.vm = iotests.VM() + self.vm.add_object('throttle-group,id=tg0,x-bps-total=1024') + self.vm.add_drive(None, 'driver=throttle,throttle-group=tg0,file.driver=%s,file.file.filename=%s' % (iotests.imgfmt, self.test_img)) self.vm.launch() def tearDown(self): - params = {"device": "drive0", - "bps": 0, - "bps_rd": 0, - "bps_wr": 0, - "iops": 0, - "iops_rd": 0, - "iops_wr": 0, - } - result = self.vm.qmp("block_set_io_throttle", conv_keys=False, - **params) self.vm.shutdown() def do_test_stop(self, cmd, **args): """Test 'stop' while block job is running on a throttled drive. The 'stop' command shouldn't drain the job""" - params = {"device": "drive0", - "bps": 1024, - "bps_rd": 0, - "bps_wr": 0, - "iops": 0, - "iops_rd": 0, - "iops_wr": 0, - } - result = self.vm.qmp("block_set_io_throttle", conv_keys=False, - **params) - self.assert_qmp(result, 'return', {}) result = self.vm.qmp(cmd, **args) self.assert_qmp(result, 'return', {}) + + print('---') + print('Right after the job was started:') + print(self.vm.qmp("query-block-jobs")) + time.sleep(5) + print('Five seconds later:') + print(self.vm.qmp("query-block-jobs")) + result = self.vm.qmp("stop") self.assert_qmp(result, 'return', {}) result = self.vm.qmp("query-block-jobs") - self.assert_qmp(result, 'return[0]/busy', True) + + print('After stop:') + print(result) + self.assert_qmp(result, 'return[0]/status', 'running') self.assert_qmp(result, 'return[0]/ready', False) + self.vm.qmp("block-job-cancel", device="drive0", force=True) + def test_drive_mirror(self): self.do_test_stop("drive-mirror", device="drive0", - target=self.target_img, + target=self.target_img, format=iotests.imgfmt, sync="full") def test_drive_backup(self): self.do_test_stop("drive-backup", device="drive0", - target=self.target_img, + target=self.target_img, format=iotests.imgfmt, sync="full") def test_block_commit(self):