diff mbox series

block/file-posix: Consider discard flag when opening

Message ID 20240618212457.714456-1-nsoffer@redhat.com (mailing list archive)
State New
Headers show
Series block/file-posix: Consider discard flag when opening | expand

Commit Message

Nir Soffer June 18, 2024, 9:24 p.m. UTC
Set has_discard only when BDRV_O_UNMAP is not set. With this users that
want to keep their images fully allocated can disable hole punching
when writing zeros or discarding using:

   -drive file=thick.img,discard=off

This change is not entirely correct since it changes the default discard
behavior.  Previously we always allowed punching holes, but now you have
must use discard=unmap|on to enable it. We probably need to add the
BDDR_O_UNMAP flag by default.

make check still works, so maybe we don't have tests for sparsifying
images, or maybe you need to run special tests that do not run by
default. We needs tests for keeping images non-sparse.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
---
 block/file-posix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Kevin Wolf June 19, 2024, 8:16 a.m. UTC | #1
Am 18.06.2024 um 23:24 hat Nir Soffer geschrieben:
> Set has_discard only when BDRV_O_UNMAP is not set. With this users that
> want to keep their images fully allocated can disable hole punching
> when writing zeros or discarding using:
> 
>    -drive file=thick.img,discard=off
> 
> This change is not entirely correct since it changes the default discard
> behavior.  Previously we always allowed punching holes, but now you have
> must use discard=unmap|on to enable it. We probably need to add the
> BDDR_O_UNMAP flag by default.
> 
> make check still works, so maybe we don't have tests for sparsifying
> images, or maybe you need to run special tests that do not run by
> default. We needs tests for keeping images non-sparse.
> 
> Signed-off-by: Nir Soffer <nsoffer@redhat.com>

So first of all, I agree with you that this patch is wrong. ;-)

At first, I failed to understand the problem this is trying to solve. I
put a debug message in handle_aiocb_discard() and tried with which
options it triggers. [1] To me, this looked exactly like it should be.
We only try to discard blocks when discard=unmap is given as an option.

That leaves the case of write_zeroes. And while at the first sight, the
code looked good, we do seem to have a problem there and it tried to
unmap even with discard=off.

>  block/file-posix.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index be25e35ff6..acac2abadc 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -738,11 +738,11 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>          ret = -EINVAL;
>          goto fail;
>      }
>  #endif /* !defined(CONFIG_LINUX_IO_URING) */
>  
> -    s->has_discard = true;
> +    s->has_discard = !!(bdrv_flags & BDRV_O_UNMAP);
>      s->has_write_zeroes = true;
>  
>      if (fstat(s->fd, &st) < 0) {
>          ret = -errno;
>          error_setg_errno(errp, errno, "Could not stat file");

s->has_discard is about what the host supports, not about the semantics
of the QEMU block node. So this doesn't feel right to me.

So for the buggy case, write_zeroes, bdrv_co_pwrite_zeroes() has code
that considers the case and clears the ~BDRV_REQ_MAY_UNMAP flags:

    if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
        flags &= ~BDRV_REQ_MAY_UNMAP;
    }

But it turns out that we don't necessarily even go through this function
for the top node which has discard=off, so it can't take effect:

(gdb) bt
#0  0x00007ffff4f2f144 in __pthread_kill_implementation () at /lib64/libc.so.6
#1  0x00007ffff4ed765e in raise () at /lib64/libc.so.6
#2  0x00007ffff4ebf902 in abort () at /lib64/libc.so.6
#3  0x000055555615aff0 in raw_do_pwrite_zeroes (bs=0x555557f4bcf0, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP, blkdev=false) at ../block/file-posix.c:3643
#4  0x000055555615557e in raw_co_pwrite_zeroes (bs=0x555557f4bcf0, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/file-posix.c:3655
#5  0x00005555560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x555557f4bcf0, offset=0, bytes=1048576, flags=6) at ../block/io.c:1901
#6  0x00005555560c72f9 in bdrv_aligned_pwritev (child=0x555557f51460, req=0x7fffed5ff800, offset=0, bytes=1048576, align=1, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2100
#7  0x00005555560c6b41 in bdrv_co_do_zero_pwritev (child=0x555557f51460, offset=0, bytes=1048576, flags=6, req=0x7fffed5ff800) at ../block/io.c:2183
#8  0x00005555560c6647 in bdrv_co_pwritev_part (child=0x555557f51460, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2283
#9  0x00005555560c634f in bdrv_co_pwritev (child=0x555557f51460, offset=0, bytes=1048576, qiov=0x0, flags=6) at ../block/io.c:2216
#10 0x00005555560c75b5 in bdrv_co_pwrite_zeroes (child=0x555557f51460, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/io.c:2322
#11 0x0000555556117d24 in raw_co_pwrite_zeroes (bs=0x555557f44980, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/raw-format.c:307
#12 0x00005555560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x555557f44980, offset=0, bytes=1048576, flags=6) at ../block/io.c:1901
#13 0x00005555560c72f9 in bdrv_aligned_pwritev (child=0x555557f513f0, req=0x7fffed5ffd90, offset=0, bytes=1048576, align=1, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2100
#14 0x00005555560c6b41 in bdrv_co_do_zero_pwritev (child=0x555557f513f0, offset=0, bytes=1048576, flags=6, req=0x7fffed5ffd90) at ../block/io.c:2183
#15 0x00005555560c6647 in bdrv_co_pwritev_part (child=0x555557f513f0, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2283
#16 0x00005555560ad741 in blk_co_do_pwritev_part (blk=0x555557f51660, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/block-backend.c:1425
#17 0x00005555560ad5f2 in blk_co_pwritev_part (blk=0x555557f51660, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/block-backend.c:1440
#18 0x00005555560ad8cf in blk_co_pwritev (blk=0x555557f51660, offset=0, bytes=1048576, qiov=0x0, flags=6) at ../block/block-backend.c:1462
#19 0x00005555560b0f79 in blk_co_pwrite_zeroes (blk=0x555557f51660, offset=0, bytes=1048576, flags=6) at ../block/block-backend.c:2590
#20 0x000055555606d240 in blk_co_pwrite_zeroes_entry (opaque=0x7fffffffbc18) at block/block-gen.c:2162
#21 0x00005555562c36ba in coroutine_trampoline (i0=1475685216, i1=21845) at ../util/coroutine-ucontext.c:175
#22 0x00007ffff4ef1190 in ??? () at /lib64/libc.so.6
#23 0x00007fffffffc010 in ??? ()
#24 0x0000000000000000 in ??? ()

I haven't checked the details yet, but my first impression is that this
check should probably move to bdrv_co_do_pwrite_zeroes().

Kevin


[1] Tests I did:

# For discard

$ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw -monitor stdio
QEMU 9.0.50 monitor - type 'help' for more information
(qemu) qemu-io none0 "discard 0 1M"
discard 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 00.00 sec (396.171 GiB/sec and 405679.5132 ops/sec)
(qemu) quit

$ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=unmap -monitor stdio
QEMU 9.0.50 monitor - type 'help' for more information
(qemu) qemu-io none0 "discard 0 1M"
handle_aiocb_discard
discard 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 00.00 sec (6.623 GiB/sec and 6782.2820 ops/sec)
(qemu) quit

$ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=off -monitor stdio
QEMU 9.0.50 monitor - type 'help' for more information
(qemu) qemu-io none0 "discard 0 1M"
discard 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 00.00 sec (375.168 GiB/sec and 384172.1091 ops/sec)
(qemu) quit

# For write_zeroes

$ echo -e 'qemu-io none0 "write -zu 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=unmap -monitor stdio
QEMU 9.0.50 monitor - type 'help' for more information
(qemu) qemu-io none0 "write -zu 0 1M"
handle_aiocb_write_zeroes_unmap
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 00.00 sec (4.943 GiB/sec and 5061.5997 ops/sec)
(qemu) quit
$ echo -e 'qemu-io none0 "write -zu 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=off -monitor stdio
QEMU 9.0.50 monitor - type 'help' for more information
(qemu) qemu-io none0 "write -zu 0 1M"
handle_aiocb_write_zeroes_unmap
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 00.00 sec (7.208 GiB/sec and 7381.4357 ops/sec)
(qemu) quit
Nir Soffer June 19, 2024, 10:58 a.m. UTC | #2
> On 19 Jun 2024, at 11:16, Kevin Wolf <kwolf@redhat.com> wrote:
> 
> Am 18.06.2024 um 23:24 hat Nir Soffer geschrieben:
>> Set has_discard only when BDRV_O_UNMAP is not set. With this users that
>> want to keep their images fully allocated can disable hole punching
>> when writing zeros or discarding using:
>> 
>>   -drive file=thick.img,discard=off
>> 
>> This change is not entirely correct since it changes the default discard
>> behavior.  Previously we always allowed punching holes, but now you have
>> must use discard=unmap|on to enable it. We probably need to add the
>> BDDR_O_UNMAP flag by default.
>> 
>> make check still works, so maybe we don't have tests for sparsifying
>> images, or maybe you need to run special tests that do not run by
>> default. We needs tests for keeping images non-sparse.
>> 
>> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> 
> So first of all, I agree with you that this patch is wrong. ;-)
> 
> At first, I failed to understand the problem this is trying to solve. I
> put a debug message in handle_aiocb_discard() and tried with which
> options it triggers. [1] To me, this looked exactly like it should be.
> We only try to discard blocks when discard=unmap is given as an option.
> 
> That leaves the case of write_zeroes. And while at the first sight, the
> code looked good, we do seem to have a problem there and it tried to
> unmap even with discard=off.
> 
>> block/file-posix.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index be25e35ff6..acac2abadc 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -738,11 +738,11 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>         ret = -EINVAL;
>>         goto fail;
>>     }
>> #endif /* !defined(CONFIG_LINUX_IO_URING) */
>> 
>> -    s->has_discard = true;
>> +    s->has_discard = !!(bdrv_flags & BDRV_O_UNMAP);
>>     s->has_write_zeroes = true;
>> 
>>     if (fstat(s->fd, &st) < 0) {
>>         ret = -errno;
>>         error_setg_errno(errp, errno, "Could not stat file");
> 
> s->has_discard is about what the host supports, not about the semantics
> of the QEMU block node. So this doesn't feel right to me.
> 
> So for the buggy case, write_zeroes, bdrv_co_pwrite_zeroes() has code
> that considers the case and clears the ~BDRV_REQ_MAY_UNMAP flags:
> 
>    if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
>        flags &= ~BDRV_REQ_MAY_UNMAP;
>    }
> 
> But it turns out that we don't necessarily even go through this function
> for the top node which has discard=off, so it can't take effect:
> 
> (gdb) bt
> #0  0x00007ffff4f2f144 in __pthread_kill_implementation () at /lib64/libc.so <http://libc.so/>.6
> #1  0x00007ffff4ed765e in raise () at /lib64/libc.so <http://libc.so/>.6
> #2  0x00007ffff4ebf902 in abort () at /lib64/libc.so <http://libc.so/>.6
> #3  0x000055555615aff0 in raw_do_pwrite_zeroes (bs=0x555557f4bcf0, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP, blkdev=false) at ../block/file-posix.c:3643
> #4  0x000055555615557e in raw_co_pwrite_zeroes (bs=0x555557f4bcf0, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/file-posix.c:3655
> #5  0x00005555560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x555557f4bcf0, offset=0, bytes=1048576, flags=6) at ../block/io.c:1901
> #6  0x00005555560c72f9 in bdrv_aligned_pwritev (child=0x555557f51460, req=0x7fffed5ff800, offset=0, bytes=1048576, align=1, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2100
> #7  0x00005555560c6b41 in bdrv_co_do_zero_pwritev (child=0x555557f51460, offset=0, bytes=1048576, flags=6, req=0x7fffed5ff800) at ../block/io.c:2183
> #8  0x00005555560c6647 in bdrv_co_pwritev_part (child=0x555557f51460, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2283
> #9  0x00005555560c634f in bdrv_co_pwritev (child=0x555557f51460, offset=0, bytes=1048576, qiov=0x0, flags=6) at ../block/io.c:2216
> #10 0x00005555560c75b5 in bdrv_co_pwrite_zeroes (child=0x555557f51460, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/io.c:2322
> #11 0x0000555556117d24 in raw_co_pwrite_zeroes (bs=0x555557f44980, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/raw-format.c:307
> #12 0x00005555560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x555557f44980, offset=0, bytes=1048576, flags=6) at ../block/io.c:1901
> #13 0x00005555560c72f9 in bdrv_aligned_pwritev (child=0x555557f513f0, req=0x7fffed5ffd90, offset=0, bytes=1048576, align=1, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2100
> #14 0x00005555560c6b41 in bdrv_co_do_zero_pwritev (child=0x555557f513f0, offset=0, bytes=1048576, flags=6, req=0x7fffed5ffd90) at ../block/io.c:2183
> #15 0x00005555560c6647 in bdrv_co_pwritev_part (child=0x555557f513f0, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2283
> #16 0x00005555560ad741 in blk_co_do_pwritev_part (blk=0x555557f51660, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/block-backend.c:1425
> #17 0x00005555560ad5f2 in blk_co_pwritev_part (blk=0x555557f51660, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/block-backend.c:1440
> #18 0x00005555560ad8cf in blk_co_pwritev (blk=0x555557f51660, offset=0, bytes=1048576, qiov=0x0, flags=6) at ../block/block-backend.c:1462
> #19 0x00005555560b0f79 in blk_co_pwrite_zeroes (blk=0x555557f51660, offset=0, bytes=1048576, flags=6) at ../block/block-backend.c:2590
> #20 0x000055555606d240 in blk_co_pwrite_zeroes_entry (opaque=0x7fffffffbc18) at block/block-gen.c:2162
> #21 0x00005555562c36ba in coroutine_trampoline (i0=1475685216, i1=21845) at ../util/coroutine-ucontext.c:175
> #22 0x00007ffff4ef1190 in ??? () at /lib64/libc.so <http://libc.so/>.6
> #23 0x00007fffffffc010 in ??? ()
> #24 0x0000000000000000 in ??? ()
> 
> I haven't checked the details yet, but my first impression is that this
> check should probably move to bdrv_co_do_pwrite_zeroes().

Punching holes only if BDRV_REQ_MAY_UNMAP bit is set sounds like the right way.

> 
> Kevin
> 
> 
> [1] Tests I did:
> 
> # For discard
> 
> $ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw -monitor stdio
> QEMU 9.0.50 monitor - type 'help' for more information
> (qemu) qemu-io none0 "discard 0 1M"
> discard 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 00.00 sec (396.171 GiB/sec and 405679.5132 ops/sec)
> (qemu) quit
> 
> $ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=unmap -monitor stdio
> QEMU 9.0.50 monitor - type 'help' for more information
> (qemu) qemu-io none0 "discard 0 1M"
> handle_aiocb_discard
> discard 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 00.00 sec (6.623 GiB/sec and 6782.2820 ops/sec)
> (qemu) quit
> 
> $ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=off -monitor stdio
> QEMU 9.0.50 monitor - type 'help' for more information
> (qemu) qemu-io none0 "discard 0 1M"
> discard 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 00.00 sec (375.168 GiB/sec and 384172.1091 ops/sec)
> (qemu) quit
> 
> # For write_zeroes
> 
> $ echo -e 'qemu-io none0 "write -zu 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=unmap -monitor stdio
> QEMU 9.0.50 monitor - type 'help' for more information
> (qemu) qemu-io none0 "write -zu 0 1M"
> handle_aiocb_write_zeroes_unmap
> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 00.00 sec (4.943 GiB/sec and 5061.5997 ops/sec)
> (qemu) quit
> $ echo -e 'qemu-io none0 "write -zu 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=off -monitor stdio
> QEMU 9.0.50 monitor - type 'help' for more information
> (qemu) qemu-io none0 "write -zu 0 1M"
> handle_aiocb_write_zeroes_unmap
> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 00.00 sec (7.208 GiB/sec and 7381.4357 ops/sec)
> (qemu) quit

Looks like a good way to write tests for this, thanks for sharing.

Nir
diff mbox series

Patch

diff --git a/block/file-posix.c b/block/file-posix.c
index be25e35ff6..acac2abadc 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -738,11 +738,11 @@  static int raw_open_common(BlockDriverState *bs, QDict *options,
         ret = -EINVAL;
         goto fail;
     }
 #endif /* !defined(CONFIG_LINUX_IO_URING) */
 
-    s->has_discard = true;
+    s->has_discard = !!(bdrv_flags & BDRV_O_UNMAP);
     s->has_write_zeroes = true;
 
     if (fstat(s->fd, &st) < 0) {
         ret = -errno;
         error_setg_errno(errp, errno, "Could not stat file");