diff mbox

[v2,1/2] qemu-iotests: reduce chance of races in 185

Message ID 20180508135436.30140-2-stefanha@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Stefan Hajnoczi May 8, 2018, 1:54 p.m. UTC
Commit 8565c3ab537e78f3e69977ec2c609dc9417a806e ("qemu-iotests: fix
185") identified a race condition in a sub-test.

Similar issues also affect the other sub-tests.  If disk I/O completes
quickly, it races with the QMP 'quit' command.  This causes spurious
test failures because QMP events are emitted in an unpredictable order.

This test relies on QEMU internals and there is no QMP API for getting
deterministic behavior needed to make this test 100% reliable.  At the
same time, the test is useful and it would be a shame to remove it.

Add sleep 0.5 to reduce the chance of races.  This is not a real fix but
appears to reduce spurious failures in practice.

Cc: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/qemu-iotests/185 | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Eric Blake May 8, 2018, 2:26 p.m. UTC | #1
On 05/08/2018 08:54 AM, Stefan Hajnoczi wrote:
> Commit 8565c3ab537e78f3e69977ec2c609dc9417a806e ("qemu-iotests: fix
> 185") identified a race condition in a sub-test.
> 
> Similar issues also affect the other sub-tests.  If disk I/O completes
> quickly, it races with the QMP 'quit' command.  This causes spurious
> test failures because QMP events are emitted in an unpredictable order.
> 
> This test relies on QEMU internals and there is no QMP API for getting
> deterministic behavior needed to make this test 100% reliable.  At the
> same time, the test is useful and it would be a shame to remove it.
> 
> Add sleep 0.5 to reduce the chance of races.  This is not a real fix but
> appears to reduce spurious failures in practice.
> 
> Cc: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   tests/qemu-iotests/185 | 12 ++++++++++++
>   1 file changed, 12 insertions(+)

I'm not opposed to this patch, but is there any way to write the test to 
take both events in either order, without logging the events as they 
arrive, but instead summarizing in a deterministic order which events 
were received after the fact?  That way, no matter which way the race is 
won, we merely log that we got two expected events, and could avoid the 
extra sleep.
Stefan Hajnoczi May 10, 2018, 10:05 a.m. UTC | #2
On Tue, May 08, 2018 at 09:26:03AM -0500, Eric Blake wrote:
> On 05/08/2018 08:54 AM, Stefan Hajnoczi wrote:
> > Commit 8565c3ab537e78f3e69977ec2c609dc9417a806e ("qemu-iotests: fix
> > 185") identified a race condition in a sub-test.
> > 
> > Similar issues also affect the other sub-tests.  If disk I/O completes
> > quickly, it races with the QMP 'quit' command.  This causes spurious
> > test failures because QMP events are emitted in an unpredictable order.
> > 
> > This test relies on QEMU internals and there is no QMP API for getting
> > deterministic behavior needed to make this test 100% reliable.  At the
> > same time, the test is useful and it would be a shame to remove it.
> > 
> > Add sleep 0.5 to reduce the chance of races.  This is not a real fix but
> > appears to reduce spurious failures in practice.
> > 
> > Cc: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >   tests/qemu-iotests/185 | 12 ++++++++++++
> >   1 file changed, 12 insertions(+)
> 
> I'm not opposed to this patch, but is there any way to write the test to
> take both events in either order, without logging the events as they arrive,
> but instead summarizing in a deterministic order which events were received
> after the fact?  That way, no matter which way the race is won, we merely
> log that we got two expected events, and could avoid the extra sleep.

I don't think there is a practical way of doing that without big changes
to the test.  It could be rewritten in Python to make filtering the QMP
events easier.

Hiding the race doesn't solve the deeper problem though: the test case
doesn't exercise the same code path each time.  The test should really
cover all cancellation points in the block job lifecycle instead of just
one at random.  If we solve this problem then we don't need to filter
the QMP event sequence.

Maybe it can be done with blkdebug.  If not then maybe a blockjobdbg
interface is necessary to perform deterministic tests (eliminating the
need for the ratelimiting trick used by this test!).

Please share ideas, but I think this is a long-term item that shouldn't
block this series.

Stefan
Eric Blake May 10, 2018, 1:24 p.m. UTC | #3
On 05/10/2018 05:05 AM, Stefan Hajnoczi wrote:

>>> Add sleep 0.5 to reduce the chance of races.  This is not a real fix but
>>> appears to reduce spurious failures in practice.
>>>
>>> Cc: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>    tests/qemu-iotests/185 | 12 ++++++++++++
>>>    1 file changed, 12 insertions(+)
>>
>> I'm not opposed to this patch, but is there any way to write the test to
>> take both events in either order, without logging the events as they arrive,
>> but instead summarizing in a deterministic order which events were received
>> after the fact?  That way, no matter which way the race is won, we merely
>> log that we got two expected events, and could avoid the extra sleep.
> 
> I don't think there is a practical way of doing that without big changes
> to the test.  It could be rewritten in Python to make filtering the QMP
> events easier.
> 
> Hiding the race doesn't solve the deeper problem though: the test case
> doesn't exercise the same code path each time.  The test should really
> cover all cancellation points in the block job lifecycle instead of just
> one at random.  If we solve this problem then we don't need to filter
> the QMP event sequence.
> 
> Maybe it can be done with blkdebug.  If not then maybe a blockjobdbg
> interface is necessary to perform deterministic tests (eliminating the
> need for the ratelimiting trick used by this test!).

So trying to restate your question - can blkdebug be used to pause I/O, 
or does it just cause an error?  If the rate limiting is in effect, then 
we expect that the job will only write to the first half of a 
destination, so a blkdebug injected error for writes to the second half 
would either not trigger (the normal cancel won the race) or does 
trigger (the job advanced before the normal cancel, but blkdebug's 
injected error also serves as a means of cancelling the job, so while 
we'd have to filter things, we at least have a deterministic way of 
ending the job before it runs to completion). Except that I'm writing 
that without even re-reading the test in question to see how it would 
all fit in.  It may or may not be worth pursuing.

> 
> Please share ideas, but I think this is a long-term item that shouldn't
> block this series.

I agree that any further improvements don't need to hold up this patch 
from making things at least more reliable, even if not perfect.  So:

Reviewed-by: Eric Blake <eblake@redhat.com>
Vladimir Sementsov-Ogievskiy May 10, 2018, 2:01 p.m. UTC | #4
08.05.2018 16:54, Stefan Hajnoczi wrote:
> Commit 8565c3ab537e78f3e69977ec2c609dc9417a806e ("qemu-iotests: fix
> 185") identified a race condition in a sub-test.
>
> Similar issues also affect the other sub-tests.  If disk I/O completes
> quickly, it races with the QMP 'quit' command.  This causes spurious
> test failures because QMP events are emitted in an unpredictable order.

Hmm, can you give an example? I'm looking at 8565c3ab537. and
it's not similar.

If disk I/O completes quickly, it will have large final offset, and test 
will
fail. And pause of 0.5 will not help.

My case was that quit is handled before the first iteration of mirror.

>
> This test relies on QEMU internals and there is no QMP API for getting
> deterministic behavior needed to make this test 100% reliable.  At the
> same time, the test is useful and it would be a shame to remove it.
>
> Add sleep 0.5 to reduce the chance of races.  This is not a real fix but
> appears to reduce spurious failures in practice.
>
> Cc: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   tests/qemu-iotests/185 | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
>
> diff --git a/tests/qemu-iotests/185 b/tests/qemu-iotests/185
> index 298d88d04e..975404c99d 100755
> --- a/tests/qemu-iotests/185
> +++ b/tests/qemu-iotests/185
> @@ -118,6 +118,9 @@ _send_qemu_cmd $h \
>                         'speed': 65536 } }" \
>       "return"
>   
> +# If we don't sleep here 'quit' command races with disk I/O
> +sleep 0.5
> +
>   _send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
>   wait=1 _cleanup_qemu
>   
> @@ -137,6 +140,9 @@ _send_qemu_cmd $h \
>                         'speed': 65536 } }" \
>       "return"
>   
> +# If we don't sleep here 'quit' command races with disk I/O
> +sleep 0.5
> +
>   _send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
>   wait=1 _cleanup_qemu
>   
> @@ -183,6 +189,9 @@ _send_qemu_cmd $h \
>                         'speed': 65536 } }" \
>       "return"
>   
> +# If we don't sleep here 'quit' command races with disk I/O
> +sleep 0.5
> +
>   _send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
>   wait=1 _cleanup_qemu
>   
> @@ -201,6 +210,9 @@ _send_qemu_cmd $h \
>                         'speed': 65536 } }" \
>       "return"
>   
> +# If we don't sleep here 'quit' command races with disk I/O
> +sleep 0.5
> +
>   _send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
>   wait=1 _cleanup_qemu
>   
>
diff mbox

Patch

diff --git a/tests/qemu-iotests/185 b/tests/qemu-iotests/185
index 298d88d04e..975404c99d 100755
--- a/tests/qemu-iotests/185
+++ b/tests/qemu-iotests/185
@@ -118,6 +118,9 @@  _send_qemu_cmd $h \
                       'speed': 65536 } }" \
     "return"
 
+# If we don't sleep here 'quit' command races with disk I/O
+sleep 0.5
+
 _send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 wait=1 _cleanup_qemu
 
@@ -137,6 +140,9 @@  _send_qemu_cmd $h \
                       'speed': 65536 } }" \
     "return"
 
+# If we don't sleep here 'quit' command races with disk I/O
+sleep 0.5
+
 _send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 wait=1 _cleanup_qemu
 
@@ -183,6 +189,9 @@  _send_qemu_cmd $h \
                       'speed': 65536 } }" \
     "return"
 
+# If we don't sleep here 'quit' command races with disk I/O
+sleep 0.5
+
 _send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 wait=1 _cleanup_qemu
 
@@ -201,6 +210,9 @@  _send_qemu_cmd $h \
                       'speed': 65536 } }" \
     "return"
 
+# If we don't sleep here 'quit' command races with disk I/O
+sleep 0.5
+
 _send_qemu_cmd $h "{ 'execute': 'quit' }" "return"
 wait=1 _cleanup_qemu