diff mbox series

[v2,1/2] block: zero data data corruption using prealloc-filter

Message ID 20240712094617.565237-2-andrey.drobyshev@virtuozzo.com (mailing list archive)
State New, archived
Headers show
Series Fix data corruption within preallocation | expand

Commit Message

Andrey Drobyshev July 12, 2024, 9:46 a.m. UTC
From: "Denis V. Lunev" <den@openvz.org>

We have observed that some clusters in the QCOW2 files are zeroed
while preallocation filter is used.

We are able to trace down the following sequence when prealloc-filter
is used:
    co=0x55e7cbed7680 qcow2_co_pwritev_task()
    co=0x55e7cbed7680 preallocate_co_pwritev_part()
    co=0x55e7cbed7680 handle_write()
    co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
    co=0x55e7cbed7680 raw_do_pwrite_zeroes()
    co=0x7f9edb7fe500 do_fallocate()

Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
time to handle next coroutine, which
    co=0x55e7cbee91b0 qcow2_co_pwritev_task()
    co=0x55e7cbee91b0 preallocate_co_pwritev_part()
    co=0x55e7cbee91b0 handle_write()
    co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
    co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
    co=0x7f9edb7deb00 do_fallocate()

The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
the same area. This means that if (once fallocate is started inside
0x7f9edb7deb00) original fallocate could end and the real write will
be executed. In that case write() request is handled at the same time
as fallocate().

The patch moves s->file_lock assignment before fallocate and that is
crucial. The idea is that all subsequent requests into the area
being preallocation will be issued as just writes without fallocate
to this area and they will not proceed thanks to overlapping
requests mechanics. If preallocation will fail, we will just switch
to the normal expand-by-write behavior and that is not a problem
except performance.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
 block/preallocate.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Vladimir Sementsov-Ogievskiy July 12, 2024, 11:55 a.m. UTC | #1
On 12.07.24 12:46, Andrey Drobyshev wrote:
> From: "Denis V. Lunev" <den@openvz.org>
> 
> We have observed that some clusters in the QCOW2 files are zeroed
> while preallocation filter is used.
> 
> We are able to trace down the following sequence when prealloc-filter
> is used:
>      co=0x55e7cbed7680 qcow2_co_pwritev_task()
>      co=0x55e7cbed7680 preallocate_co_pwritev_part()
>      co=0x55e7cbed7680 handle_write()
>      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>      co=0x7f9edb7fe500 do_fallocate()
> 
> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> time to handle next coroutine, which
>      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>      co=0x55e7cbee91b0 handle_write()
>      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>      co=0x7f9edb7deb00 do_fallocate()
> 
> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> the same area. This means that if (once fallocate is started inside
> 0x7f9edb7deb00) original fallocate could end and the real write will
> be executed. In that case write() request is handled at the same time
> as fallocate().
> 
> The patch moves s->file_lock assignment before fallocate and that is

text need to be updated

> crucial. The idea is that all subsequent requests into the area
> being preallocation will be issued as just writes without fallocate
> to this area and they will not proceed thanks to overlapping
> requests mechanics. If preallocation will fail, we will just switch
> to the normal expand-by-write behavior and that is not a problem
> except performance.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
>   block/preallocate.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/preallocate.c b/block/preallocate.c
> index d215bc5d6d..ecf0aa4baa 100644
> --- a/block/preallocate.c
> +++ b/block/preallocate.c
> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>   
>       want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>   
> +    /*
> +     * Assign file_end before making actual preallocation. This will ensure
> +     * that next request performed while preallocation is in progress will
> +     * be passed without preallocation.
> +     */
> +    s->file_end = prealloc_end;
> +
>       ret = bdrv_co_pwrite_zeroes(
>               bs->file, prealloc_start, prealloc_end - prealloc_start,
>               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>           return false;
>       }
>   
> -    s->file_end = prealloc_end;
>       return want_merge_zero;
>   }
>   


Hmm. But this way we set both s->file_end and s->zero_start prior to actual write_zero operation. This means that next write-zero operation may go fast-path (see preallocate_co_pwrite_zeroes()) and return success, even before actual finish of preallocation write_zeroes operation (which may also fail). Seems we need to update logic around s->zero_start too.
Denis V. Lunev July 16, 2024, 1:32 p.m. UTC | #2
On 7/12/24 13:55, Vladimir Sementsov-Ogievskiy wrote:
> On 12.07.24 12:46, Andrey Drobyshev wrote:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> We have observed that some clusters in the QCOW2 files are zeroed
>> while preallocation filter is used.
>>
>> We are able to trace down the following sequence when prealloc-filter
>> is used:
>>      co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>      co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>      co=0x55e7cbed7680 handle_write()
>>      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>      co=0x7f9edb7fe500 do_fallocate()
>>
>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>> time to handle next coroutine, which
>>      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>      co=0x55e7cbee91b0 handle_write()
>>      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>      co=0x7f9edb7deb00 do_fallocate()
>>
>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>> the same area. This means that if (once fallocate is started inside
>> 0x7f9edb7deb00) original fallocate could end and the real write will
>> be executed. In that case write() request is handled at the same time
>> as fallocate().
>>
>> The patch moves s->file_lock assignment before fallocate and that is
>
> text need to be updated
>
>> crucial. The idea is that all subsequent requests into the area
>> being preallocation will be issued as just writes without fallocate
>> to this area and they will not proceed thanks to overlapping
>> requests mechanics. If preallocation will fail, we will just switch
>> to the normal expand-by-write behavior and that is not a problem
>> except performance.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>>   block/preallocate.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/preallocate.c b/block/preallocate.c
>> index d215bc5d6d..ecf0aa4baa 100644
>> --- a/block/preallocate.c
>> +++ b/block/preallocate.c
>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t 
>> offset, int64_t bytes,
>>         want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>   +    /*
>> +     * Assign file_end before making actual preallocation. This will 
>> ensure
>> +     * that next request performed while preallocation is in 
>> progress will
>> +     * be passed without preallocation.
>> +     */
>> +    s->file_end = prealloc_end;
>> +
>>       ret = bdrv_co_pwrite_zeroes(
>>               bs->file, prealloc_start, prealloc_end - prealloc_start,
>>               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | 
>> BDRV_REQ_NO_WAIT);
>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t 
>> offset, int64_t bytes,
>>           return false;
>>       }
>>   -    s->file_end = prealloc_end;
>>       return want_merge_zero;
>>   }
>
>
> Hmm. But this way we set both s->file_end and s->zero_start prior to 
> actual write_zero operation. This means that next write-zero operation 
> may go fast-path (see preallocate_co_pwrite_zeroes()) and return 
> success, even before actual finish of preallocation write_zeroes 
> operation (which may also fail). Seems we need to update logic around 
> s->zero_start too.
>
Yes. This is not a problem at all. We go fast path and this new
fast-pathed write request will stuck on overlapped request check.
This if fine on success path.

But error path is a trickier question.

iris ~/src/qemu $ cat 1.c
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>

int main()
{
     int fd = open("file", O_RDWR | O_CREAT);
     char buf[4096];

     memset(buf, 'a', sizeof(buf));
     pwrite(fd, buf, sizeof(buf), 4096);

     return 0;
}
iris ~/src/qemu $

This works just fine, thus error path would be also fine.

Den
Andrey Drobyshev July 16, 2024, 2:27 p.m. UTC | #3
On 7/16/24 4:32 PM, Denis V. Lunev wrote:
> On 7/12/24 13:55, Vladimir Sementsov-Ogievskiy wrote:
>> On 12.07.24 12:46, Andrey Drobyshev wrote:
>>> From: "Denis V. Lunev" <den@openvz.org>
>>>
>>> We have observed that some clusters in the QCOW2 files are zeroed
>>> while preallocation filter is used.
>>>
>>> We are able to trace down the following sequence when prealloc-filter
>>> is used:
>>>      co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>>      co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>>      co=0x55e7cbed7680 handle_write()
>>>      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>>      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>>      co=0x7f9edb7fe500 do_fallocate()
>>>
>>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>>> time to handle next coroutine, which
>>>      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>>      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>>      co=0x55e7cbee91b0 handle_write()
>>>      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>>      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>>      co=0x7f9edb7deb00 do_fallocate()
>>>
>>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>>> the same area. This means that if (once fallocate is started inside
>>> 0x7f9edb7deb00) original fallocate could end and the real write will
>>> be executed. In that case write() request is handled at the same time
>>> as fallocate().
>>>
>>> The patch moves s->file_lock assignment before fallocate and that is
>>
>> text need to be updated
>>
>>> crucial. The idea is that all subsequent requests into the area
>>> being preallocation will be issued as just writes without fallocate
>>> to this area and they will not proceed thanks to overlapping
>>> requests mechanics. If preallocation will fail, we will just switch
>>> to the normal expand-by-write behavior and that is not a problem
>>> except performance.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>>> ---
>>>   block/preallocate.c | 8 +++++++-
>>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>> index d215bc5d6d..ecf0aa4baa 100644
>>> --- a/block/preallocate.c
>>> +++ b/block/preallocate.c
>>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t
>>> offset, int64_t bytes,
>>>         want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>>   +    /*
>>> +     * Assign file_end before making actual preallocation. This will
>>> ensure
>>> +     * that next request performed while preallocation is in
>>> progress will
>>> +     * be passed without preallocation.
>>> +     */
>>> +    s->file_end = prealloc_end;
>>> +
>>>       ret = bdrv_co_pwrite_zeroes(
>>>               bs->file, prealloc_start, prealloc_end - prealloc_start,
>>>               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING |
>>> BDRV_REQ_NO_WAIT);
>>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t
>>> offset, int64_t bytes,
>>>           return false;
>>>       }
>>>   -    s->file_end = prealloc_end;
>>>       return want_merge_zero;
>>>   }
>>
>>
>> Hmm. But this way we set both s->file_end and s->zero_start prior to
>> actual write_zero operation. This means that next write-zero operation
>> may go fast-path (see preallocate_co_pwrite_zeroes()) and return
>> success, even before actual finish of preallocation write_zeroes
>> operation (which may also fail). Seems we need to update logic around
>> s->zero_start too.
>>
> Yes. This is not a problem at all. We go fast path and this new
> fast-pathed write request will stuck on overlapped request check.
> This if fine on success path.
> 
> But error path is a trickier question.
> 
> iris ~/src/qemu $ cat 1.c
> #include <stdio.h>
> #include <unistd.h>
> #include <string.h>
> #include <fcntl.h>
> 
> int main()
> {
>     int fd = open("file", O_RDWR | O_CREAT);
>     char buf[4096];
> 
>     memset(buf, 'a', sizeof(buf));
>     pwrite(fd, buf, sizeof(buf), 4096);
> 
>     return 0;
> }
> iris ~/src/qemu $
> 
> This works just fine, thus error path would be also fine.
> 
> Den

This would also be relatively easy to check by modifying our original
test case in 298 so that half of aio_write requests write actual data,
and other half cause write_zeroes operations.  It doesn't seem to fail
with this v2 patch.

I'll modify the testcase accordingly in v3.
Vladimir Sementsov-Ogievskiy Aug. 2, 2024, 3:31 p.m. UTC | #4
On 16.07.24 16:32, Denis V. Lunev wrote:
> On 7/12/24 13:55, Vladimir Sementsov-Ogievskiy wrote:
>> On 12.07.24 12:46, Andrey Drobyshev wrote:
>>> From: "Denis V. Lunev" <den@openvz.org>
>>>
>>> We have observed that some clusters in the QCOW2 files are zeroed
>>> while preallocation filter is used.
>>>
>>> We are able to trace down the following sequence when prealloc-filter
>>> is used:
>>>      co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>>      co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>>      co=0x55e7cbed7680 handle_write()
>>>      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>>      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>>      co=0x7f9edb7fe500 do_fallocate()
>>>
>>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>>> time to handle next coroutine, which
>>>      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>>      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>>      co=0x55e7cbee91b0 handle_write()
>>>      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>>      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>>      co=0x7f9edb7deb00 do_fallocate()
>>>
>>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>>> the same area. This means that if (once fallocate is started inside
>>> 0x7f9edb7deb00) original fallocate could end and the real write will
>>> be executed. In that case write() request is handled at the same time
>>> as fallocate().
>>>
>>> The patch moves s->file_lock assignment before fallocate and that is
>>
>> text need to be updated
>>
>>> crucial. The idea is that all subsequent requests into the area
>>> being preallocation will be issued as just writes without fallocate
>>> to this area and they will not proceed thanks to overlapping
>>> requests mechanics. If preallocation will fail, we will just switch
>>> to the normal expand-by-write behavior and that is not a problem
>>> except performance.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>>> ---
>>>   block/preallocate.c | 8 +++++++-
>>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>> index d215bc5d6d..ecf0aa4baa 100644
>>> --- a/block/preallocate.c
>>> +++ b/block/preallocate.c
>>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>>         want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>>   +    /*
>>> +     * Assign file_end before making actual preallocation. This will ensure
>>> +     * that next request performed while preallocation is in progress will
>>> +     * be passed without preallocation.
>>> +     */
>>> +    s->file_end = prealloc_end;
>>> +
>>>       ret = bdrv_co_pwrite_zeroes(
>>>               bs->file, prealloc_start, prealloc_end - prealloc_start,
>>>               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
>>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>>           return false;
>>>       }
>>>   -    s->file_end = prealloc_end;
>>>       return want_merge_zero;
>>>   }
>>
>>
>> Hmm. But this way we set both s->file_end and s->zero_start prior to actual write_zero operation. This means that next write-zero operation may go fast-path (see preallocate_co_pwrite_zeroes()) and return success, even before actual finish of preallocation write_zeroes operation (which may also fail). Seems we need to update logic around s->zero_start too.
>>
> Yes. This is not a problem at all. We go fast path and this new
> fast-pathed write request will stuck on overlapped request check.
> This if fine on success path.

Sorry for long delay. By fast-path I meant when we return true from

     /* Now s->data_end, s->zero_start and s->file_end are valid. */
                                                                                  
     if (end <= s->file_end) {
         /* No preallocation needed. */
         return want_merge_zero && offset >= s->zero_start;
     }

and then, preallocate_co_pwrite_zeroes() just reports success immediately. So, we don't stuck in any overlap-check.

Probably, that's not too bad: we just skip the write-zero after EOF. If user read from this area, zeroes should be returned anyway.

Still, if we do so we mix our preallocation (which we truncate at the end) with user written zeroes. So we may truncate user-written zeroes in the end. How much is it bad it's a question, but it was not planned in original code.

> 
> But error path is a trickier question.
> 
> iris ~/src/qemu $ cat 1.c
> #include <stdio.h>
> #include <unistd.h>
> #include <string.h>
> #include <fcntl.h>
> 
> int main()
> {
>      int fd = open("file", O_RDWR | O_CREAT);
>      char buf[4096];
> 
>      memset(buf, 'a', sizeof(buf));
>      pwrite(fd, buf, sizeof(buf), 4096);
> 
>      return 0;
> }
> iris ~/src/qemu $
> 
> This works just fine, thus error path would be also fine.
> 
> Den
diff mbox series

Patch

diff --git a/block/preallocate.c b/block/preallocate.c
index d215bc5d6d..ecf0aa4baa 100644
--- a/block/preallocate.c
+++ b/block/preallocate.c
@@ -383,6 +383,13 @@  handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
 
     want_merge_zero = want_merge_zero && (prealloc_start <= offset);
 
+    /*
+     * Assign file_end before making actual preallocation. This will ensure
+     * that next request performed while preallocation is in progress will
+     * be passed without preallocation.
+     */
+    s->file_end = prealloc_end;
+
     ret = bdrv_co_pwrite_zeroes(
             bs->file, prealloc_start, prealloc_end - prealloc_start,
             BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
@@ -391,7 +398,6 @@  handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
         return false;
     }
 
-    s->file_end = prealloc_end;
     return want_merge_zero;
 }