diff mbox series

block: file-posix: Fail unmap with NO_FALLBACK on block device

Message ID 20200613170826.354270-1-nsoffer@redhat.com (mailing list archive)
State New, archived
Headers show
Series block: file-posix: Fail unmap with NO_FALLBACK on block device | expand

Commit Message

Nir Soffer June 13, 2020, 5:08 p.m. UTC
Punching holes on block device uses blkdev_issue_zeroout() with
BLKDEV_ZERO_NOFALLBACK but there is no guarantee that this is fast
enough for pre-zeroing an entire device.

Zeroing block device can be slow as writing zeroes or 100 times faster,
depending on the storage. There is no way to tell if zeroing it fast
enough.  The kernel BLKDEV_ZERO_NOFALLBACK flag does not mean that the
operation is fast; it just means that the kernel will not fall back to
manual zeroing.

Here is an example converting 10g image with 8g of data to block device:

$ ./qemu-img info test.img
image: test.img
file format: raw
virtual size: 10 GiB (10737418240 bytes)
disk size: 8 GiB

$ time ./qemu-img convert -f raw -O raw -t none -T none -W test.img /dev/test/lv1

Before:

real    1m20.483s
user    0m0.490s
sys     0m0.739s

After:

real    0m55.831s
user    0m0.610s
sys     0m0.956s

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
---
 block/file-posix.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Nir Soffer June 15, 2020, 7:32 p.m. UTC | #1
On Sat, Jun 13, 2020 at 8:08 PM Nir Soffer <nirsof@gmail.com> wrote:
>
> Punching holes on block device uses blkdev_issue_zeroout() with
> BLKDEV_ZERO_NOFALLBACK but there is no guarantee that this is fast
> enough for pre-zeroing an entire device.
>
> Zeroing block device can be slow as writing zeroes or 100 times faster,
> depending on the storage. There is no way to tell if zeroing it fast
> enough.  The kernel BLKDEV_ZERO_NOFALLBACK flag does not mean that the
> operation is fast; it just means that the kernel will not fall back to
> manual zeroing.
>
> Here is an example converting 10g image with 8g of data to block device:
>
> $ ./qemu-img info test.img
> image: test.img
> file format: raw
> virtual size: 10 GiB (10737418240 bytes)
> disk size: 8 GiB
>
> $ time ./qemu-img convert -f raw -O raw -t none -T none -W test.img /dev/test/lv1
>
> Before:
>
> real    1m20.483s
> user    0m0.490s
> sys     0m0.739s
>
> After:
>
> real    0m55.831s
> user    0m0.610s
> sys     0m0.956s

I did more testing with real server and storage, and the results confirm
what I reported here in my vm based environment and poor storage.


Testing this LUN:

# multipath -ll
3600a098038304437415d4b6a59684a52 dm-3 NETAPP,LUN C-Mode
size=5.0T features='3 queue_if_no_path pg_init_retries 50'
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 18:0:0:0 sdb     8:16  active ready running
| `- 19:0:0:0 sdc     8:32  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 20:0:0:0 sdd     8:48  active ready running
  `- 21:0:0:0 sde     8:64  active ready running


The destination is 100g logical volume on this LUN:

# qemu-img info test-lv
image: test-lv
file format: raw
virtual size: 100 GiB (107374182400 bytes)
disk size: 0 B


The source image is 100g image with 48g of data:

# qemu-img info fedora-31-100g-50p.raw
image: fedora-31-100g-50p.raw
file format: raw
virtual size: 100 GiB (107374182400 bytes)
disk size: 48.4 GiB


We can zero 2.3 g/s:

# time blkdiscard -z test-lv

real 0m43.902s
user 0m0.002s
sys 0m0.130s

(I should really test with fallocate instead of blkdiscard, but the results look
the same.)

# iostat -xdm dm-3 5

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3            20.80  301.40      0.98   2323.31     0.00     0.00
0.00   0.00   26.56  854.50 257.94    48.23  7893.41   0.73  23.58

dm-3            15.20  297.20      0.80   2321.67     0.00     0.00
0.00   0.00   26.43  836.06 248.72    53.80  7999.30   0.78  24.22


We can write 445m/s:

# dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync
107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s

# iostat -xdm dm-3 5

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3             6.60 6910.00      0.39    431.85     0.00     0.00
0.00   0.00    2.48    2.70  15.19    60.73    64.00   0.14  98.84

dm-3            40.80 6682.60      1.59    417.61     0.00     0.00
0.00   0.00    1.71    2.73  14.92    40.00    63.99   0.15  97.60

dm-3             6.60 6887.40      0.39    430.46     0.00     0.00
0.00   0.00    2.15    2.66  14.92    60.73    64.00   0.14  98.22


Testing latest qemu-img:

# rpm -q qemu-img
qemu-img-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64

# time qemu-img convert -p -f raw -O raw -t none -W
fedora-31-100g-50p.raw test-lv
    (100.00/100%)

real 2m2.337s
user 0m2.708s
sys 0m17.326s

# iostat -xdm dm-3 5

pre zero phase:

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3            24.00  265.40      1.00   2123.20     0.00     0.00
0.00   0.00   36.81  543.52 144.99    42.48  8192.00   0.70  20.14

dm-3             9.60  283.60      0.59   2265.60     0.00     0.00
0.00   0.00   35.42  576.80 163.78    62.50  8180.44   0.70  20.58

dm-3            24.00  272.00      1.00   2176.00     0.00     0.00
0.00   0.00   22.89  512.40 139.77    42.48  8192.00   0.67  19.90

copy phase:

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3            27.20 10671.20      1.19    655.84     0.00     0.00
0.00   0.00    2.70   10.99 111.98    44.83    62.93   0.09  96.74

dm-3             6.40 11537.00      0.39    712.33     0.00     0.00
0.00   0.00    3.00   11.90 131.52    62.50    63.23   0.08  97.82

dm-3            27.20 12400.20      1.19    765.47     0.00     0.00
0.00   0.00    3.60   11.16 132.31    44.83    63.21   0.08  95.50

dm-3             9.60 11312.60      0.59    698.20     0.00     0.20
0.00   0.00    3.73   11.69 126.64    63.00    63.20   0.09  97.70


Testing latest qemu-img + this patch:

# rpm -q qemu-img
qemu-img-4.2.0-25.module+el8.2.1+6815+1c792dc8.nsoffer202006140516.x86_64

# time qemu-img convert -p -f raw -O raw -t none -W
fedora-31-100g-50p.raw test-lv
    (100.00/100%)

real 1m42.083s
user 0m3.007s
sys 0m18.735s

# iostat -xdm dm-3 5

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3             6.60 7919.60      0.39   1136.67     0.00     0.00
0.00   0.00   14.70   15.32 117.43    60.73   146.97   0.10  77.84

dm-3            27.00 9065.00      1.19    571.38     0.00     0.20
0.00   0.00    2.52   14.64 128.21    45.13    64.54   0.11  97.46

dm-3             6.80 9467.40      0.40    814.75     0.00     0.00
0.00   0.00    2.74   12.15 110.25    60.82    88.12   0.10  90.46

dm-3            29.00 7713.20      1.32    996.48     0.00     0.40
0.00   0.01    5.40   14.48 107.98    46.60   132.29   0.11  83.76

dm-3            11.60 9661.60      0.70    703.54     0.00     0.40
0.00   0.00    2.26   11.22 103.56    61.72    74.57   0.10  97.98

dm-3            23.80 9639.20      0.99    696.82     0.00     0.00
0.00   0.00    1.98   11.54 106.49    42.80    74.03   0.10  93.68

dm-3            10.00 7184.60      0.60   1147.56     0.00     0.00
0.00   0.00   12.84   15.32 106.58    61.36   163.56   0.09  68.30

dm-3            35.00 6771.40      1.69   1293.37     0.00     0.00
0.00   0.00   17.44   18.06 119.48    49.58   195.59   0.10  66.46




>
> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> ---
>  block/file-posix.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 3ab8f5a0fa..cd2e409184 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1621,6 +1621,16 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque)
>      /* First try to write zeros and unmap at the same time */
>
>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +    /*
> +     * The block device fallocate() implementation in the kernel does set
> +     * BLKDEV_ZERO_NOFALLBACK, but it does not guarantee that the operation is
> +     * fast so we can't call this if we have to avoid slow fallbacks.
> +     */
> +    if (aiocb->aio_type & QEMU_AIO_BLKDEV &&
> +        aiocb->aio_type & QEMU_AIO_NO_FALLBACK) {
> +        return -ENOTSUP;
> +    }
> +
>      int ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                             aiocb->aio_offset, aiocb->aio_nbytes);
>      if (ret != -ENOTSUP) {
> --
> 2.25.4
>
Kevin Wolf June 16, 2020, 3:32 p.m. UTC | #2
Am 15.06.2020 um 21:32 hat Nir Soffer geschrieben:
> We can zero 2.3 g/s:
> 
> # time blkdiscard -z test-lv
> 
> real 0m43.902s
> user 0m0.002s
> sys 0m0.130s

> We can write 445m/s:
> 
> # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync
> 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s

So using FALLOC_FL_PUNCH_HOLE _is_ faster after all. What might not be
faster is zeroing out the whole device and then overwriting a
considerable part of it again.

I think this means that we shouldn't fail write_zeroes at the file-posix
level even if BDRV_REQ_NO_FALLBACK is given. Instead, qemu-img convert
is where I see a fix.

Certainly qemu-img could be cleverer and zero out more selectively. The
idea of doing a blk_make_zero() first seems to have caused some
problems, though of course its introduction was also justified with
performance, so improving one case might hurt another if we're not
careful.

However, when Peter Lieven introduced this (commit 5a37b60a61c), we
didn't use write_zeroes yet during the regular copy loop (we do since
commit 690c7301600). So chances are that blk_make_zero() doesn't
actually help any more now.

Can you run another test with the patch below? I think it should perform
the same as yours. Eric, Peter, do you think this would have a negative
effect for NBD and/or iscsi?

The other option would be providing an option and making it Someone
Else's Problem.

Kevin


diff --git a/qemu-img.c b/qemu-img.c
index d7e846e607..bdb9f6aa46 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2084,15 +2084,6 @@ static int convert_do_copy(ImgConvertState *s)
         s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
     }

-    if (!s->has_zero_init && !s->target_has_backing &&
-        bdrv_can_write_zeroes_with_unmap(blk_bs(s->target)))
-    {
-        ret = blk_make_zero(s->target, BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK);
-        if (ret == 0) {
-            s->has_zero_init = true;
-        }
-    }
-
     /* Allocate buffer for copied data. For compressed images, only one cluster
      * can be copied at a time. */
     if (s->compressed) {
Nir Soffer June 16, 2020, 5:39 p.m. UTC | #3
On Tue, Jun 16, 2020 at 6:32 PM Kevin Wolf <kwolf@redhat.com> wrote:
>
> Am 15.06.2020 um 21:32 hat Nir Soffer geschrieben:
> > We can zero 2.3 g/s:
> >
> > # time blkdiscard -z test-lv
> >
> > real 0m43.902s
> > user 0m0.002s
> > sys 0m0.130s
>
> > We can write 445m/s:
> >
> > # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync
> > 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s
>
> So using FALLOC_FL_PUNCH_HOLE _is_ faster after all. What might not be
> faster is zeroing out the whole device and then overwriting a
> considerable part of it again.
>
> I think this means that we shouldn't fail write_zeroes at the file-posix
> level even if BDRV_REQ_NO_FALLBACK is given. Instead, qemu-img convert
> is where I see a fix.
>
> Certainly qemu-img could be cleverer and zero out more selectively. The
> idea of doing a blk_make_zero() first seems to have caused some
> problems, though of course its introduction was also justified with
> performance, so improving one case might hurt another if we're not
> careful.
>
> However, when Peter Lieven introduced this (commit 5a37b60a61c), we
> didn't use write_zeroes yet during the regular copy loop (we do since
> commit 690c7301600). So chances are that blk_make_zero() doesn't
> actually help any more now.
>
> Can you run another test with the patch below?

Sure, I can try this.

> I think it should perform
> the same as yours. Eric, Peter, do you think this would have a negative
> effect for NBD and/or iscsi?
>
> The other option would be providing an option and making it Someone
> Else's Problem.
>
> Kevin
>
>
> diff --git a/qemu-img.c b/qemu-img.c
> index d7e846e607..bdb9f6aa46 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -2084,15 +2084,6 @@ static int convert_do_copy(ImgConvertState *s)
>          s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
>      }
>
> -    if (!s->has_zero_init && !s->target_has_backing &&
> -        bdrv_can_write_zeroes_with_unmap(blk_bs(s->target)))
> -    {
> -        ret = blk_make_zero(s->target, BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK);
> -        if (ret == 0) {
> -            s->has_zero_init = true;
> -        }
> -    }

This will work of course, but now we will not do bulk zero for any target

I think we never do this for regular files anyway since we truncate
files, and there
is nothing to zero, but maybe there is some case when this is useful?

BTW, do we use BDRV_REQ_NO_FALLBACK elsewhere? maybe we can remove it.

>      /* Allocate buffer for copied data. For compressed images, only one cluster
>       * can be copied at a time. */
>      if (s->compressed) {
>
Nir Soffer June 16, 2020, 8:01 p.m. UTC | #4
On Tue, Jun 16, 2020 at 8:39 PM Nir Soffer <nsoffer@redhat.com> wrote:
>
> On Tue, Jun 16, 2020 at 6:32 PM Kevin Wolf <kwolf@redhat.com> wrote:
> >
> > Am 15.06.2020 um 21:32 hat Nir Soffer geschrieben:
> > > We can zero 2.3 g/s:
> > >
> > > # time blkdiscard -z test-lv
> > >
> > > real 0m43.902s
> > > user 0m0.002s
> > > sys 0m0.130s
> >
> > > We can write 445m/s:
> > >
> > > # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync
> > > 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s
> >
> > So using FALLOC_FL_PUNCH_HOLE _is_ faster after all. What might not be
> > faster is zeroing out the whole device and then overwriting a
> > considerable part of it again.
> >
> > I think this means that we shouldn't fail write_zeroes at the file-posix
> > level even if BDRV_REQ_NO_FALLBACK is given. Instead, qemu-img convert
> > is where I see a fix.
> >
> > Certainly qemu-img could be cleverer and zero out more selectively. The
> > idea of doing a blk_make_zero() first seems to have caused some
> > problems, though of course its introduction was also justified with
> > performance, so improving one case might hurt another if we're not
> > careful.
> >
> > However, when Peter Lieven introduced this (commit 5a37b60a61c), we
> > didn't use write_zeroes yet during the regular copy loop (we do since
> > commit 690c7301600). So chances are that blk_make_zero() doesn't
> > actually help any more now.
> >
> > Can you run another test with the patch below?
>
> Sure, I can try this.

Tried it, and it performs the same as expected.

> > I think it should perform
> > the same as yours. Eric, Peter, do you think this would have a negative
> > effect for NBD and/or iscsi?
> >
> > The other option would be providing an option and making it Someone
> > Else's Problem.
> >
> > Kevin
> >
> >
> > diff --git a/qemu-img.c b/qemu-img.c
> > index d7e846e607..bdb9f6aa46 100644
> > --- a/qemu-img.c
> > +++ b/qemu-img.c
> > @@ -2084,15 +2084,6 @@ static int convert_do_copy(ImgConvertState *s)
> >          s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
> >      }
> >
> > -    if (!s->has_zero_init && !s->target_has_backing &&
> > -        bdrv_can_write_zeroes_with_unmap(blk_bs(s->target)))
> > -    {
> > -        ret = blk_make_zero(s->target, BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK);
> > -        if (ret == 0) {
> > -            s->has_zero_init = true;
> > -        }
> > -    }
>
> This will work of course, but now we will not do bulk zero for any target

I would like to have a minimal change to increase the chance that we can
consume this very soon in oVirt.

> I think we never do this for regular files anyway since we truncate
> files, and there
> is nothing to zero, but maybe there is some case when this is useful?
>
> BTW, do we use BDRV_REQ_NO_FALLBACK elsewhere? maybe we can remove it.
>
> >      /* Allocate buffer for copied data. For compressed images, only one cluster
> >       * can be copied at a time. */
> >      if (s->compressed) {
> >
Eric Blake June 16, 2020, 8:09 p.m. UTC | #5
On 6/16/20 10:32 AM, Kevin Wolf wrote:
> Am 15.06.2020 um 21:32 hat Nir Soffer geschrieben:
>> We can zero 2.3 g/s:
>>
>> # time blkdiscard -z test-lv
>>
>> real 0m43.902s
>> user 0m0.002s
>> sys 0m0.130s
> 
>> We can write 445m/s:
>>
>> # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync
>> 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s
> 
> So using FALLOC_FL_PUNCH_HOLE _is_ faster after all. What might not be
> faster is zeroing out the whole device and then overwriting a
> considerable part of it again.

Yeah, there can indeed be a difference between a pre-zeroing which can 
be super-fast (on a posix file, truncate to 0 and back to the desired 
size, for example), and where it is faster than writes but still slower 
than a single pass.

> 
> I think this means that we shouldn't fail write_zeroes at the file-posix
> level even if BDRV_REQ_NO_FALLBACK is given. Instead, qemu-img convert
> is where I see a fix.

Is the kernel able to tell us reliably when we can perform a fast 
pre-zero pass?  If it can't, it's that much harder to define when 
BDRV_REQ_NO_FALLBACK makes a difference.

> 
> Certainly qemu-img could be cleverer and zero out more selectively. The
> idea of doing a blk_make_zero() first seems to have caused some
> problems, though of course its introduction was also justified with
> performance, so improving one case might hurt another if we're not
> careful.
> 
> However, when Peter Lieven introduced this (commit 5a37b60a61c), we
> didn't use write_zeroes yet during the regular copy loop (we do since
> commit 690c7301600). So chances are that blk_make_zero() doesn't
> actually help any more now.
> 
> Can you run another test with the patch below? I think it should perform
> the same as yours. Eric, Peter, do you think this would have a negative
> effect for NBD and/or iscsi?

I'm still hoping to revive my work on making bdrv_make_zero a per-driver 
callback with smarts for the fastest possible pre-zeroing that driver is 
capable of, or fast failure when BDRV_REQ_NO_FALLBACK is set and it is 
no faster to pre-zero than it is to just write zeroes when needed.  I 
can certainly construct NBD scenarios in either direction (where a 
pre-zeroing pass is faster because of less network traffic, or where a 
pre-zeroing pass is slower because of increased I/O - in fact, that was 
part of my KVM Forum 2019 demo on why the NBD protocol added a FAST_ZERO 
flag mirroring the idea of qemu's BDRV_REQ_NO_FALLBACK).

> 
> The other option would be providing an option and making it Someone
> Else's Problem.

Matching what we recently did with --target-is-zero.

> 
> Kevin
> 
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index d7e846e607..bdb9f6aa46 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -2084,15 +2084,6 @@ static int convert_do_copy(ImgConvertState *s)
>           s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
>       }
> 
> -    if (!s->has_zero_init && !s->target_has_backing &&
> -        bdrv_can_write_zeroes_with_unmap(blk_bs(s->target)))
> -    {
> -        ret = blk_make_zero(s->target, BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK);
> -        if (ret == 0) {
> -            s->has_zero_init = true;
> -        }
> -    }
> -
>       /* Allocate buffer for copied data. For compressed images, only one cluster
>        * can be copied at a time. */
>       if (s->compressed) {
>
Kevin Wolf June 17, 2020, 10:47 a.m. UTC | #6
Am 16.06.2020 um 22:01 hat Nir Soffer geschrieben:
> On Tue, Jun 16, 2020 at 8:39 PM Nir Soffer <nsoffer@redhat.com> wrote:
> >
> > On Tue, Jun 16, 2020 at 6:32 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > >
> > > Am 15.06.2020 um 21:32 hat Nir Soffer geschrieben:
> > > > We can zero 2.3 g/s:
> > > >
> > > > # time blkdiscard -z test-lv
> > > >
> > > > real 0m43.902s
> > > > user 0m0.002s
> > > > sys 0m0.130s
> > >
> > > > We can write 445m/s:
> > > >
> > > > # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync
> > > > 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s
> > >
> > > So using FALLOC_FL_PUNCH_HOLE _is_ faster after all. What might not be
> > > faster is zeroing out the whole device and then overwriting a
> > > considerable part of it again.
> > >
> > > I think this means that we shouldn't fail write_zeroes at the file-posix
> > > level even if BDRV_REQ_NO_FALLBACK is given. Instead, qemu-img convert
> > > is where I see a fix.
> > >
> > > Certainly qemu-img could be cleverer and zero out more selectively. The
> > > idea of doing a blk_make_zero() first seems to have caused some
> > > problems, though of course its introduction was also justified with
> > > performance, so improving one case might hurt another if we're not
> > > careful.
> > >
> > > However, when Peter Lieven introduced this (commit 5a37b60a61c), we
> > > didn't use write_zeroes yet during the regular copy loop (we do since
> > > commit 690c7301600). So chances are that blk_make_zero() doesn't
> > > actually help any more now.
> > >
> > > Can you run another test with the patch below?
> >
> > Sure, I can try this.
> 
> Tried it, and it performs the same as expected.

Thanks.

> > > I think it should perform
> > > the same as yours. Eric, Peter, do you think this would have a negative
> > > effect for NBD and/or iscsi?
> > >
> > > The other option would be providing an option and making it Someone
> > > Else's Problem.
> > >
> > > Kevin
> > >
> > >
> > > diff --git a/qemu-img.c b/qemu-img.c
> > > index d7e846e607..bdb9f6aa46 100644
> > > --- a/qemu-img.c
> > > +++ b/qemu-img.c
> > > @@ -2084,15 +2084,6 @@ static int convert_do_copy(ImgConvertState *s)
> > >          s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
> > >      }
> > >
> > > -    if (!s->has_zero_init && !s->target_has_backing &&
> > > -        bdrv_can_write_zeroes_with_unmap(blk_bs(s->target)))
> > > -    {
> > > -        ret = blk_make_zero(s->target, BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK);
> > > -        if (ret == 0) {
> > > -            s->has_zero_init = true;
> > > -        }
> > > -    }
> >
> > This will work of course, but now we will not do bulk zero for any target
> 
> I would like to have a minimal change to increase the chance that we
> can consume this very soon in oVirt.

I think this one would be pretty minimal.

Maybe we can later bring this code back, but with an implementation of
blk_make_zero() that doesn't use the generic write_zeroes operation,
but with a specific callback like Eric suggested.

> > I think we never do this for regular files anyway since we truncate
> > files, and there is nothing to zero, but maybe there is some case
> > when this is useful?

Yes, regular files have s->has_zero_init == true anyway.

> > BTW, do we use BDRV_REQ_NO_FALLBACK elsewhere? maybe we can remove
> > it.

qcow2 uses it when zeroing out parts of a newly allocated cluster on
partial writes. Though that code is questionable, too, and XFS people
suggest that we should stop using it.

Kevin
diff mbox series

Patch

diff --git a/block/file-posix.c b/block/file-posix.c
index 3ab8f5a0fa..cd2e409184 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1621,6 +1621,16 @@  static int handle_aiocb_write_zeroes_unmap(void *opaque)
     /* First try to write zeros and unmap at the same time */
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+    /*
+     * The block device fallocate() implementation in the kernel does set
+     * BLKDEV_ZERO_NOFALLBACK, but it does not guarantee that the operation is
+     * fast so we can't call this if we have to avoid slow fallbacks.
+     */
+    if (aiocb->aio_type & QEMU_AIO_BLKDEV &&
+        aiocb->aio_type & QEMU_AIO_NO_FALLBACK) {
+        return -ENOTSUP;
+    }
+
     int ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                            aiocb->aio_offset, aiocb->aio_nbytes);
     if (ret != -ENOTSUP) {