Message ID | 20200613170826.354270-1-nsoffer@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: file-posix: Fail unmap with NO_FALLBACK on block device | expand |
On Sat, Jun 13, 2020 at 8:08 PM Nir Soffer <nirsof@gmail.com> wrote: > > Punching holes on block device uses blkdev_issue_zeroout() with > BLKDEV_ZERO_NOFALLBACK but there is no guarantee that this is fast > enough for pre-zeroing an entire device. > > Zeroing block device can be slow as writing zeroes or 100 times faster, > depending on the storage. There is no way to tell if zeroing it fast > enough. The kernel BLKDEV_ZERO_NOFALLBACK flag does not mean that the > operation is fast; it just means that the kernel will not fall back to > manual zeroing. > > Here is an example converting 10g image with 8g of data to block device: > > $ ./qemu-img info test.img > image: test.img > file format: raw > virtual size: 10 GiB (10737418240 bytes) > disk size: 8 GiB > > $ time ./qemu-img convert -f raw -O raw -t none -T none -W test.img /dev/test/lv1 > > Before: > > real 1m20.483s > user 0m0.490s > sys 0m0.739s > > After: > > real 0m55.831s > user 0m0.610s > sys 0m0.956s I did more testing with real server and storage, and the results confirm what I reported here in my vm based environment and poor storage. Testing this LUN: # multipath -ll 3600a098038304437415d4b6a59684a52 dm-3 NETAPP,LUN C-Mode size=5.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 18:0:0:0 sdb 8:16 active ready running | `- 19:0:0:0 sdc 8:32 active ready running `-+- policy='service-time 0' prio=10 status=enabled |- 20:0:0:0 sdd 8:48 active ready running `- 21:0:0:0 sde 8:64 active ready running The destination is 100g logical volume on this LUN: # qemu-img info test-lv image: test-lv file format: raw virtual size: 100 GiB (107374182400 bytes) disk size: 0 B The source image is 100g image with 48g of data: # qemu-img info fedora-31-100g-50p.raw image: fedora-31-100g-50p.raw file format: raw virtual size: 100 GiB (107374182400 bytes) disk size: 48.4 GiB We can zero 2.3 g/s: # time blkdiscard -z test-lv real 0m43.902s user 0m0.002s sys 0m0.130s (I should really test with fallocate instead of blkdiscard, but the results look the same.) # iostat -xdm dm-3 5 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util dm-3 20.80 301.40 0.98 2323.31 0.00 0.00 0.00 0.00 26.56 854.50 257.94 48.23 7893.41 0.73 23.58 dm-3 15.20 297.20 0.80 2321.67 0.00 0.00 0.00 0.00 26.43 836.06 248.72 53.80 7999.30 0.78 24.22 We can write 445m/s: # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s # iostat -xdm dm-3 5 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util dm-3 6.60 6910.00 0.39 431.85 0.00 0.00 0.00 0.00 2.48 2.70 15.19 60.73 64.00 0.14 98.84 dm-3 40.80 6682.60 1.59 417.61 0.00 0.00 0.00 0.00 1.71 2.73 14.92 40.00 63.99 0.15 97.60 dm-3 6.60 6887.40 0.39 430.46 0.00 0.00 0.00 0.00 2.15 2.66 14.92 60.73 64.00 0.14 98.22 Testing latest qemu-img: # rpm -q qemu-img qemu-img-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64 # time qemu-img convert -p -f raw -O raw -t none -W fedora-31-100g-50p.raw test-lv (100.00/100%) real 2m2.337s user 0m2.708s sys 0m17.326s # iostat -xdm dm-3 5 pre zero phase: Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util dm-3 24.00 265.40 1.00 2123.20 0.00 0.00 0.00 0.00 36.81 543.52 144.99 42.48 8192.00 0.70 20.14 dm-3 9.60 283.60 0.59 2265.60 0.00 0.00 0.00 0.00 35.42 576.80 163.78 62.50 8180.44 0.70 20.58 dm-3 24.00 272.00 1.00 2176.00 0.00 0.00 0.00 0.00 22.89 512.40 139.77 42.48 8192.00 0.67 19.90 copy phase: Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util dm-3 27.20 10671.20 1.19 655.84 0.00 0.00 0.00 0.00 2.70 10.99 111.98 44.83 62.93 0.09 96.74 dm-3 6.40 11537.00 0.39 712.33 0.00 0.00 0.00 0.00 3.00 11.90 131.52 62.50 63.23 0.08 97.82 dm-3 27.20 12400.20 1.19 765.47 0.00 0.00 0.00 0.00 3.60 11.16 132.31 44.83 63.21 0.08 95.50 dm-3 9.60 11312.60 0.59 698.20 0.00 0.20 0.00 0.00 3.73 11.69 126.64 63.00 63.20 0.09 97.70 Testing latest qemu-img + this patch: # rpm -q qemu-img qemu-img-4.2.0-25.module+el8.2.1+6815+1c792dc8.nsoffer202006140516.x86_64 # time qemu-img convert -p -f raw -O raw -t none -W fedora-31-100g-50p.raw test-lv (100.00/100%) real 1m42.083s user 0m3.007s sys 0m18.735s # iostat -xdm dm-3 5 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util dm-3 6.60 7919.60 0.39 1136.67 0.00 0.00 0.00 0.00 14.70 15.32 117.43 60.73 146.97 0.10 77.84 dm-3 27.00 9065.00 1.19 571.38 0.00 0.20 0.00 0.00 2.52 14.64 128.21 45.13 64.54 0.11 97.46 dm-3 6.80 9467.40 0.40 814.75 0.00 0.00 0.00 0.00 2.74 12.15 110.25 60.82 88.12 0.10 90.46 dm-3 29.00 7713.20 1.32 996.48 0.00 0.40 0.00 0.01 5.40 14.48 107.98 46.60 132.29 0.11 83.76 dm-3 11.60 9661.60 0.70 703.54 0.00 0.40 0.00 0.00 2.26 11.22 103.56 61.72 74.57 0.10 97.98 dm-3 23.80 9639.20 0.99 696.82 0.00 0.00 0.00 0.00 1.98 11.54 106.49 42.80 74.03 0.10 93.68 dm-3 10.00 7184.60 0.60 1147.56 0.00 0.00 0.00 0.00 12.84 15.32 106.58 61.36 163.56 0.09 68.30 dm-3 35.00 6771.40 1.69 1293.37 0.00 0.00 0.00 0.00 17.44 18.06 119.48 49.58 195.59 0.10 66.46 > > Signed-off-by: Nir Soffer <nsoffer@redhat.com> > --- > block/file-posix.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/block/file-posix.c b/block/file-posix.c > index 3ab8f5a0fa..cd2e409184 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -1621,6 +1621,16 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque) > /* First try to write zeros and unmap at the same time */ > > #ifdef CONFIG_FALLOCATE_PUNCH_HOLE > + /* > + * The block device fallocate() implementation in the kernel does set > + * BLKDEV_ZERO_NOFALLBACK, but it does not guarantee that the operation is > + * fast so we can't call this if we have to avoid slow fallbacks. > + */ > + if (aiocb->aio_type & QEMU_AIO_BLKDEV && > + aiocb->aio_type & QEMU_AIO_NO_FALLBACK) { > + return -ENOTSUP; > + } > + > int ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > aiocb->aio_offset, aiocb->aio_nbytes); > if (ret != -ENOTSUP) { > -- > 2.25.4 >
Am 15.06.2020 um 21:32 hat Nir Soffer geschrieben: > We can zero 2.3 g/s: > > # time blkdiscard -z test-lv > > real 0m43.902s > user 0m0.002s > sys 0m0.130s > We can write 445m/s: > > # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync > 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s So using FALLOC_FL_PUNCH_HOLE _is_ faster after all. What might not be faster is zeroing out the whole device and then overwriting a considerable part of it again. I think this means that we shouldn't fail write_zeroes at the file-posix level even if BDRV_REQ_NO_FALLBACK is given. Instead, qemu-img convert is where I see a fix. Certainly qemu-img could be cleverer and zero out more selectively. The idea of doing a blk_make_zero() first seems to have caused some problems, though of course its introduction was also justified with performance, so improving one case might hurt another if we're not careful. However, when Peter Lieven introduced this (commit 5a37b60a61c), we didn't use write_zeroes yet during the regular copy loop (we do since commit 690c7301600). So chances are that blk_make_zero() doesn't actually help any more now. Can you run another test with the patch below? I think it should perform the same as yours. Eric, Peter, do you think this would have a negative effect for NBD and/or iscsi? The other option would be providing an option and making it Someone Else's Problem. Kevin diff --git a/qemu-img.c b/qemu-img.c index d7e846e607..bdb9f6aa46 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -2084,15 +2084,6 @@ static int convert_do_copy(ImgConvertState *s) s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target)); } - if (!s->has_zero_init && !s->target_has_backing && - bdrv_can_write_zeroes_with_unmap(blk_bs(s->target))) - { - ret = blk_make_zero(s->target, BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK); - if (ret == 0) { - s->has_zero_init = true; - } - } - /* Allocate buffer for copied data. For compressed images, only one cluster * can be copied at a time. */ if (s->compressed) {
On Tue, Jun 16, 2020 at 6:32 PM Kevin Wolf <kwolf@redhat.com> wrote: > > Am 15.06.2020 um 21:32 hat Nir Soffer geschrieben: > > We can zero 2.3 g/s: > > > > # time blkdiscard -z test-lv > > > > real 0m43.902s > > user 0m0.002s > > sys 0m0.130s > > > We can write 445m/s: > > > > # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync > > 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s > > So using FALLOC_FL_PUNCH_HOLE _is_ faster after all. What might not be > faster is zeroing out the whole device and then overwriting a > considerable part of it again. > > I think this means that we shouldn't fail write_zeroes at the file-posix > level even if BDRV_REQ_NO_FALLBACK is given. Instead, qemu-img convert > is where I see a fix. > > Certainly qemu-img could be cleverer and zero out more selectively. The > idea of doing a blk_make_zero() first seems to have caused some > problems, though of course its introduction was also justified with > performance, so improving one case might hurt another if we're not > careful. > > However, when Peter Lieven introduced this (commit 5a37b60a61c), we > didn't use write_zeroes yet during the regular copy loop (we do since > commit 690c7301600). So chances are that blk_make_zero() doesn't > actually help any more now. > > Can you run another test with the patch below? Sure, I can try this. > I think it should perform > the same as yours. Eric, Peter, do you think this would have a negative > effect for NBD and/or iscsi? > > The other option would be providing an option and making it Someone > Else's Problem. > > Kevin > > > diff --git a/qemu-img.c b/qemu-img.c > index d7e846e607..bdb9f6aa46 100644 > --- a/qemu-img.c > +++ b/qemu-img.c > @@ -2084,15 +2084,6 @@ static int convert_do_copy(ImgConvertState *s) > s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target)); > } > > - if (!s->has_zero_init && !s->target_has_backing && > - bdrv_can_write_zeroes_with_unmap(blk_bs(s->target))) > - { > - ret = blk_make_zero(s->target, BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK); > - if (ret == 0) { > - s->has_zero_init = true; > - } > - } This will work of course, but now we will not do bulk zero for any target I think we never do this for regular files anyway since we truncate files, and there is nothing to zero, but maybe there is some case when this is useful? BTW, do we use BDRV_REQ_NO_FALLBACK elsewhere? maybe we can remove it. > /* Allocate buffer for copied data. For compressed images, only one cluster > * can be copied at a time. */ > if (s->compressed) { >
On Tue, Jun 16, 2020 at 8:39 PM Nir Soffer <nsoffer@redhat.com> wrote: > > On Tue, Jun 16, 2020 at 6:32 PM Kevin Wolf <kwolf@redhat.com> wrote: > > > > Am 15.06.2020 um 21:32 hat Nir Soffer geschrieben: > > > We can zero 2.3 g/s: > > > > > > # time blkdiscard -z test-lv > > > > > > real 0m43.902s > > > user 0m0.002s > > > sys 0m0.130s > > > > > We can write 445m/s: > > > > > > # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync > > > 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s > > > > So using FALLOC_FL_PUNCH_HOLE _is_ faster after all. What might not be > > faster is zeroing out the whole device and then overwriting a > > considerable part of it again. > > > > I think this means that we shouldn't fail write_zeroes at the file-posix > > level even if BDRV_REQ_NO_FALLBACK is given. Instead, qemu-img convert > > is where I see a fix. > > > > Certainly qemu-img could be cleverer and zero out more selectively. The > > idea of doing a blk_make_zero() first seems to have caused some > > problems, though of course its introduction was also justified with > > performance, so improving one case might hurt another if we're not > > careful. > > > > However, when Peter Lieven introduced this (commit 5a37b60a61c), we > > didn't use write_zeroes yet during the regular copy loop (we do since > > commit 690c7301600). So chances are that blk_make_zero() doesn't > > actually help any more now. > > > > Can you run another test with the patch below? > > Sure, I can try this. Tried it, and it performs the same as expected. > > I think it should perform > > the same as yours. Eric, Peter, do you think this would have a negative > > effect for NBD and/or iscsi? > > > > The other option would be providing an option and making it Someone > > Else's Problem. > > > > Kevin > > > > > > diff --git a/qemu-img.c b/qemu-img.c > > index d7e846e607..bdb9f6aa46 100644 > > --- a/qemu-img.c > > +++ b/qemu-img.c > > @@ -2084,15 +2084,6 @@ static int convert_do_copy(ImgConvertState *s) > > s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target)); > > } > > > > - if (!s->has_zero_init && !s->target_has_backing && > > - bdrv_can_write_zeroes_with_unmap(blk_bs(s->target))) > > - { > > - ret = blk_make_zero(s->target, BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK); > > - if (ret == 0) { > > - s->has_zero_init = true; > > - } > > - } > > This will work of course, but now we will not do bulk zero for any target I would like to have a minimal change to increase the chance that we can consume this very soon in oVirt. > I think we never do this for regular files anyway since we truncate > files, and there > is nothing to zero, but maybe there is some case when this is useful? > > BTW, do we use BDRV_REQ_NO_FALLBACK elsewhere? maybe we can remove it. > > > /* Allocate buffer for copied data. For compressed images, only one cluster > > * can be copied at a time. */ > > if (s->compressed) { > >
On 6/16/20 10:32 AM, Kevin Wolf wrote: > Am 15.06.2020 um 21:32 hat Nir Soffer geschrieben: >> We can zero 2.3 g/s: >> >> # time blkdiscard -z test-lv >> >> real 0m43.902s >> user 0m0.002s >> sys 0m0.130s > >> We can write 445m/s: >> >> # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync >> 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s > > So using FALLOC_FL_PUNCH_HOLE _is_ faster after all. What might not be > faster is zeroing out the whole device and then overwriting a > considerable part of it again. Yeah, there can indeed be a difference between a pre-zeroing which can be super-fast (on a posix file, truncate to 0 and back to the desired size, for example), and where it is faster than writes but still slower than a single pass. > > I think this means that we shouldn't fail write_zeroes at the file-posix > level even if BDRV_REQ_NO_FALLBACK is given. Instead, qemu-img convert > is where I see a fix. Is the kernel able to tell us reliably when we can perform a fast pre-zero pass? If it can't, it's that much harder to define when BDRV_REQ_NO_FALLBACK makes a difference. > > Certainly qemu-img could be cleverer and zero out more selectively. The > idea of doing a blk_make_zero() first seems to have caused some > problems, though of course its introduction was also justified with > performance, so improving one case might hurt another if we're not > careful. > > However, when Peter Lieven introduced this (commit 5a37b60a61c), we > didn't use write_zeroes yet during the regular copy loop (we do since > commit 690c7301600). So chances are that blk_make_zero() doesn't > actually help any more now. > > Can you run another test with the patch below? I think it should perform > the same as yours. Eric, Peter, do you think this would have a negative > effect for NBD and/or iscsi? I'm still hoping to revive my work on making bdrv_make_zero a per-driver callback with smarts for the fastest possible pre-zeroing that driver is capable of, or fast failure when BDRV_REQ_NO_FALLBACK is set and it is no faster to pre-zero than it is to just write zeroes when needed. I can certainly construct NBD scenarios in either direction (where a pre-zeroing pass is faster because of less network traffic, or where a pre-zeroing pass is slower because of increased I/O - in fact, that was part of my KVM Forum 2019 demo on why the NBD protocol added a FAST_ZERO flag mirroring the idea of qemu's BDRV_REQ_NO_FALLBACK). > > The other option would be providing an option and making it Someone > Else's Problem. Matching what we recently did with --target-is-zero. > > Kevin > > > diff --git a/qemu-img.c b/qemu-img.c > index d7e846e607..bdb9f6aa46 100644 > --- a/qemu-img.c > +++ b/qemu-img.c > @@ -2084,15 +2084,6 @@ static int convert_do_copy(ImgConvertState *s) > s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target)); > } > > - if (!s->has_zero_init && !s->target_has_backing && > - bdrv_can_write_zeroes_with_unmap(blk_bs(s->target))) > - { > - ret = blk_make_zero(s->target, BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK); > - if (ret == 0) { > - s->has_zero_init = true; > - } > - } > - > /* Allocate buffer for copied data. For compressed images, only one cluster > * can be copied at a time. */ > if (s->compressed) { >
Am 16.06.2020 um 22:01 hat Nir Soffer geschrieben: > On Tue, Jun 16, 2020 at 8:39 PM Nir Soffer <nsoffer@redhat.com> wrote: > > > > On Tue, Jun 16, 2020 at 6:32 PM Kevin Wolf <kwolf@redhat.com> wrote: > > > > > > Am 15.06.2020 um 21:32 hat Nir Soffer geschrieben: > > > > We can zero 2.3 g/s: > > > > > > > > # time blkdiscard -z test-lv > > > > > > > > real 0m43.902s > > > > user 0m0.002s > > > > sys 0m0.130s > > > > > > > We can write 445m/s: > > > > > > > > # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync > > > > 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s > > > > > > So using FALLOC_FL_PUNCH_HOLE _is_ faster after all. What might not be > > > faster is zeroing out the whole device and then overwriting a > > > considerable part of it again. > > > > > > I think this means that we shouldn't fail write_zeroes at the file-posix > > > level even if BDRV_REQ_NO_FALLBACK is given. Instead, qemu-img convert > > > is where I see a fix. > > > > > > Certainly qemu-img could be cleverer and zero out more selectively. The > > > idea of doing a blk_make_zero() first seems to have caused some > > > problems, though of course its introduction was also justified with > > > performance, so improving one case might hurt another if we're not > > > careful. > > > > > > However, when Peter Lieven introduced this (commit 5a37b60a61c), we > > > didn't use write_zeroes yet during the regular copy loop (we do since > > > commit 690c7301600). So chances are that blk_make_zero() doesn't > > > actually help any more now. > > > > > > Can you run another test with the patch below? > > > > Sure, I can try this. > > Tried it, and it performs the same as expected. Thanks. > > > I think it should perform > > > the same as yours. Eric, Peter, do you think this would have a negative > > > effect for NBD and/or iscsi? > > > > > > The other option would be providing an option and making it Someone > > > Else's Problem. > > > > > > Kevin > > > > > > > > > diff --git a/qemu-img.c b/qemu-img.c > > > index d7e846e607..bdb9f6aa46 100644 > > > --- a/qemu-img.c > > > +++ b/qemu-img.c > > > @@ -2084,15 +2084,6 @@ static int convert_do_copy(ImgConvertState *s) > > > s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target)); > > > } > > > > > > - if (!s->has_zero_init && !s->target_has_backing && > > > - bdrv_can_write_zeroes_with_unmap(blk_bs(s->target))) > > > - { > > > - ret = blk_make_zero(s->target, BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK); > > > - if (ret == 0) { > > > - s->has_zero_init = true; > > > - } > > > - } > > > > This will work of course, but now we will not do bulk zero for any target > > I would like to have a minimal change to increase the chance that we > can consume this very soon in oVirt. I think this one would be pretty minimal. Maybe we can later bring this code back, but with an implementation of blk_make_zero() that doesn't use the generic write_zeroes operation, but with a specific callback like Eric suggested. > > I think we never do this for regular files anyway since we truncate > > files, and there is nothing to zero, but maybe there is some case > > when this is useful? Yes, regular files have s->has_zero_init == true anyway. > > BTW, do we use BDRV_REQ_NO_FALLBACK elsewhere? maybe we can remove > > it. qcow2 uses it when zeroing out parts of a newly allocated cluster on partial writes. Though that code is questionable, too, and XFS people suggest that we should stop using it. Kevin
diff --git a/block/file-posix.c b/block/file-posix.c index 3ab8f5a0fa..cd2e409184 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1621,6 +1621,16 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque) /* First try to write zeros and unmap at the same time */ #ifdef CONFIG_FALLOCATE_PUNCH_HOLE + /* + * The block device fallocate() implementation in the kernel does set + * BLKDEV_ZERO_NOFALLBACK, but it does not guarantee that the operation is + * fast so we can't call this if we have to avoid slow fallbacks. + */ + if (aiocb->aio_type & QEMU_AIO_BLKDEV && + aiocb->aio_type & QEMU_AIO_NO_FALLBACK) { + return -ENOTSUP; + } + int ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, aiocb->aio_offset, aiocb->aio_nbytes); if (ret != -ENOTSUP) {
Punching holes on block device uses blkdev_issue_zeroout() with BLKDEV_ZERO_NOFALLBACK but there is no guarantee that this is fast enough for pre-zeroing an entire device. Zeroing block device can be slow as writing zeroes or 100 times faster, depending on the storage. There is no way to tell if zeroing it fast enough. The kernel BLKDEV_ZERO_NOFALLBACK flag does not mean that the operation is fast; it just means that the kernel will not fall back to manual zeroing. Here is an example converting 10g image with 8g of data to block device: $ ./qemu-img info test.img image: test.img file format: raw virtual size: 10 GiB (10737418240 bytes) disk size: 8 GiB $ time ./qemu-img convert -f raw -O raw -t none -T none -W test.img /dev/test/lv1 Before: real 1m20.483s user 0m0.490s sys 0m0.739s After: real 0m55.831s user 0m0.610s sys 0m0.956s Signed-off-by: Nir Soffer <nsoffer@redhat.com> --- block/file-posix.c | 10 ++++++++++ 1 file changed, 10 insertions(+)